System and method for risk sensitive reinforcement learning architecture

ABSTRACT

A computer-implemented system and method for training an auomated agent are disclosed. An example system includes: a communication interface; at least one processor; memory in communication with said at least one processor; software code stored in said memory, which when executed causes said system to: instantiate an automated agent that maintains a reinforcement learning neural network and generates, according to outputs of said reinforcement learning neural network, signals for communicating task requests; receive a plurality of states and a plurality of actions for the automated agent; initialize a learning table Q for the automated agent based on the plurality of states and the plurality of actions; compute a plurality of updated learning tables based on the initialized learning table Q using a utility function, the utility function comprising a monotonically increasing concave function; and generate an averaged learning table Q′ based on the plurality of updated learning tables.

CROSS-REFERENCE TO RELATED APPLICATION

This application claims the benefit of and priority to U.S. provisionalpatent application No. 63/209,615, filed on Jun. 11, 2021, the entirecontent of which is herein incorporated by reference.

FIELD

The present disclosure generally relates to the field of computerprocessing, and in particular to reinforcement learning architectures inmachine learning.

BACKGROUND

Reinforcement learning can be used to train and deploy computerizedagents (hereinafter simply “agents”) in trading markets, however, suchapplication may carry fundamental challenges such as high variance andcostly exploration. Moreover, markets are inherently a multi-agentdomain having many actors taking actions and changing the environment.To tackle these type of scenarios, agents need to exhibit certaincharacteristics such as risk-awareness, robustness to perturbations andlow learning variance.

SUMMARY

According to an aspect, there is provided a computer-implemented systemfor training an automated agent. The system includes a communicationinterface; at least one processor; memory in communication with the atleast one processor; software code stored in the memory, which whenexecuted at the at least one processor causes the system to: instantiatean automated agent that maintains a reinforcement learning neuralnetwork and generates, according to outputs of the reinforcementlearning neural network, signals for communicating task requests;receive, by way of the communication interface, a plurality of traininginput data including a plurality of states and a plurality of actionsfor the automated agent; initialize a learning table Q for the automatedagent based on the plurality of states and the plurality of actions;compute a plurality of updated learning tables based on the initializedlearning table Q using a utility function, the utility functioncomprising a monotonically increasing concave function; and generate anaveraged learning table Q′ based on the plurality of updated learningtables.

In some embodiments, the automated agent is configured to select anaction based on the averaged learning table Q′ for communicating one ormore task requests.

In some embodiments, utility function is represented by u(x)=−e^(βx),β<0.

In some embodiments, computing a plurality of updated learning tablesmay include: receiving, by way of the communication interface, an inputparameter k and a training step parameter T; and for each training stept, where t=1, 2 . . . T: computing an interim learning table {circumflexover (Q)} based on the initialized learning table Q; selecting an actiona_(t) from the plurality of actions based on the interim learning table{circumflex over (Q)} and a given state s_(t) from the plurality ofstates; computing a reward r_(t) and a next state s_(t+1) based on theselected action a_(t); and for at least two values of i, where i=1, 2, .. . , k, computing a respective updated learning table Q^(i) of theplurality of updated learning tables based on (s_(t),a_(t), r_(t),s_(t+1)) and the utility function.

In some embodiments, the averaged learning table Q′ is computed as

$\frac{1}{k}{\sum_{i = 1}^{k}{Q^{i}.}}$

In some embodiments, the utility function is a first utility functionand the software code, when executed at the at least one processor,further causes the system to: instantiate an adversarial agent thatmaintains an adversarial reinforcement learning neural network andgenerates, according to outputs of the adversarial reinforcementlearning neural network, signals for communicating adversarial taskrequests; initialize an adversarial learning table Q_(A) for theadversarial agent; compute a plurality of updated adversarial learningtables based on the initialized adversarial learning table Q_(A) using asecond utility function, the second utility function comprising amonotonically increasing convex function; and generate an averagedadversarial learning table Q_(A)′ based on the plurality of updatedadversarial learning tables.

In some embodiments, the adversarial agent is configured to select anadversarial action based on the averaged adversarial learning tableQ_(A)′ to minimize a reward for the automated agent.

In some embodiments, the second utility function is represented byu^(A)(x)=−e^(β) ^(A) ^(x), β^(A)>0.

In some embodiments, computing a plurality of updated adversariallearning tables may include: receiving, by way of the communicationinterface, an input parameter k and a training step parameter T; and foreach training step t, where t=1, 2 . . . T: computing an interimadversarial learning table {circumflex over (Q)}^(A) based on theinitialized adversarial learning table Q^(A); selecting an adversarialaction a_(t) ^(A) based on the interim adversarial learning table{circumflex over (Q)}^(A) and a given state s_(t) from the plurality ofstates; computing an adversarial reward r_(t) ^(A) and a next states_(t+1) based on the selected adversarial action a_(t) ^(A); and for atleast two values of i, where i=1, 2, . . . , k, computing a respectiveupdated adversarial learning table Q^(i) _(A) of the plurality ofupdated adversarial learning tables based on (s_(t),a_(t) ^(A),r_(t)^(A),s_(t+1)) and the second utility function.

In some embodiments, the averaged adversarial learning table Q A′ iscomputed as

$\frac{1}{k}{\sum_{i = 1}^{k}{Q_{A}^{i}.}}$

According to another aspect, there is provided a computer-implementedmethod of training an automated agent, the method including:instantiating an automated agent that maintains a reinforcement learningneural network and generates, according to outputs of the reinforcementlearning neural network, signals for communicating task requests;receiving, by way of the communication interface, a plurality oftraining input data including a plurality of states and a plurality ofactions for the automated agent; initializing a learning table Q for theautomated agent based on the plurality of states and the plurality ofactions; computing a plurality of updated learning tables based on theinitialized learning table Q using a utility function, the utilityfunction comprising a monotonically increasing concave function; andgenerating an averaged learning table Q′ based on the plurality ofupdated learning tables.

In some embodiments, the method may further include: selecting anaction, by the automated agent, based on the averaged learning table Q′for communicating one or more task requests.

In some embodiments, utility function is represented by u(x)=−e^(βx),β<0.

In some embodiments, computing a plurality of updated learning tablesmay include: receiving, by way of the communication interface, an inputparameter k and a training step parameter T; and for each training stept, where t=1, 2 . . . T: computing an interim learning table {circumflexover (Q)} based on the initialized learning table Q; selecting an actiona_(t) from the plurality of actions based on the interim learning table{circumflex over (Q)} and a given state s_(t) from the plurality ofstates; computing a reward r_(t) and a next state s_(t+1) based on theselected action a_(t); and for at least two values of i, where i=1, 2, .. . , k, computing a respective updated learning table Q^(i) of theplurality of updated learning tables based on (s_(t),a_(t), r_(t),s_(t+1)) and the utility function.

In some embodiments, the averaged learning table Q′ is computed as

$\frac{1}{k}{\sum_{i = 1}^{k}{Q^{i}.}}$

In some embodiments, the utility function is a first utility functionand the method may further include: instantiating an adversarial agentthat maintains an adversarial reinforcement learning neural network andgenerates, according to outputs of the adversarial reinforcementlearning neural network, signals for communicating adversarial taskrequests; initializing an adversarial learning table Q_(A) for theadversarial agent; computing a plurality of updated adversarial learningtables based on the initialized adversarial learning table Q_(A) using asecond utility function, the second utility function comprising amonotonically increasing convex function; and generating an averagedadversarial learning table Q_(A)′ based on the plurality of updatedadversarial learning tables.

In some embodiments, the method may further include selecting anadversarial action by the adversarial agent based on the averagedadversarial learning table Q_(A)′ to minimize a reward for the automatedagent.

In some embodiments, the second utility function is represented by u^(A)(x)=−e^(β) ^(A) ^(x), β^(A)>0.

In some embodiments, computing a plurality of updated adversariallearning tables may include: receiving, by way of the communicationinterface, an input parameter k and a training step parameter T; and foreach training step t, where t=1, 2 . . . T: computing an interimadversarial learning table Q_(A) based on the initialized adversariallearning table Q_(A); selecting an adversarial action a_(t) ^(A) basedon the interim adversarial learning table Q_(A) and a given state s_(t)from the plurality of states; computing an adversarial reward r_(t)A anda next state s_(t+), based on the selected adversarial action a_(t)A;and for at least two values of i, where i=1, 2, . . . , k, computing arespective updated adversarial learning table Q^(i) _(A) of theplurality of updated adversarial learning tables based on (s_(t),a_(t)^(A),r_(t) ^(A),s_(t+1)) and the second utility function.

In some embodiments, the averaged adversarial learning table Q_(A)′ iscomputed as

$\frac{1}{k}{\sum_{i = 1}^{k}{Q_{A}^{i}.}}$

Other features will become apparent from the drawings in conjunctionwith the following description.

BRIEF DESCRIPTION OF DRAWINGS

In the figures, embodiments are illustrated by way of example. It is tobe expressly understood that the description and figures are only forthe purpose of illustration and as an aid to understanding.

Embodiments will now be described, by way of example only, withreference to the attached figures, wherein in the figures:

FIG. 1A is an example schematic diagram of a training system, inaccordance with an embodiment.

FIG. 1B is a schematic diagram of an automated agent, in accordance withan embodiment.

FIG. 1C is a schematic diagram of an example neural network maintainedat the computer-implemented system of FIG. 1A.

FIG. 2 is an example screen from a lunar lander game, in accordance withan embodiment.

FIG. 3 shows an example risk-averse averaged Q-learning algorithm, inaccordance with an embodiment.

FIG. 4 shows an example variance reduced risk-averse Q-learningalgorithm, in accordance with an embodiment.

FIG. 5 shows an example risk-averse multi-agent Q-learning algorithm, inaccordance with an embodiment.

FIG. 6 shows an example risk-averse adversarial averaged Q-learningalgorithm, in accordance with an embodiment.

FIG. 7 shows a directional field plot and a trajectory plot of severalRL algorithms, in accordance with an embodiment.

FIG. 8 shows an example risk-averse Q-learning algorithm, in accordancewith an embodiment.

FIG. 9 shows an example Nash Q-learning algorithm for an agent, inaccordance with an embodiment.

FIGS. 10A and 10B show an example risk-averse adversarial Q-learningalgorithm. in accordance with an embodiment.

FIG. 11A shows a directional field plot of a meta-game payoff. inaccordance with an embodiment.

FIG. 11B shows a trajectory plot of the meta-game payoff in FIG. 11A, inaccordance with an embodiment.

FIG. 12A shows a directional field plot of another meta-game payoff, inaccordance with an embodiment.

FIG. 12B shows a trajectory plot of the meta-game payoff in FIG. 12A, inaccordance with an embodiment.

FIG. 13 is a schematic diagram of a computing device that implements atraining system, in accordance with an embodiment.

DETAILED DESCRIPTION

Reinforcement learning (RL) is a type of machine technique that enablesan agent to learn in an interactive environment by trial and error usingfeedback from its own actions and experiences, in order to maximize areward. RL has been applied to a number of fields such as games [5],navigation [4], software engineering [2], industrial design [22], andfinance [18]. Each of these applications has inherent difficulties whichare long-standing fundamental challenges in RL, such as: limitedtraining time, costly exploration and safety considerations, amongothers.

In particular, in finance, there are some examples of RL in stochasticcontrol problems such as option pricing [19], market making [29], andoptimal execution [24]. An example finance application is trading, wherethe system is configured to implement trained algorithms capable ofautomatically making trading decisions based on a set of stored rulescomputed by a machine [31].

In trading, the environment represents the market (and the rest of theactors). The agent's task is to take actions related to how and how muchto trade, and the objective is usually to maximize profit whileminimizing risk. There are several challenges in this setting such aspartial observability, a large action space, a difficult definition ofrewards and learning objectives [31]. In this disclosure, two propertiesfor learning agents in realistic trading market scenarios are consideredand implemented: risk assessment and robustness. In some embodiments,one or more RL algorithms are implemented with risk-averse objectivefunctions and variance reduction techniques. In some embodiments, the RLalgorithms are implemented to operate in a multi-agent learningenvironment, and can assume an adversary which may take over and perturbthe learning process. These RL algorithms are developed to balancetheoretical guarantees with practical use. Additionally, an empiricalgame theory analysis for multi-agent learning by considering risk-aversepayoffs is performed and discussed herein.

Risk assessment is a cornerstone in financial applications. One approachis to consider risk while assessing the performance (profit) of atrading strategy. Here, risk is a quantity related to the variance (orstandard deviation) of the profit and it is commonly refereed to as“volatility”. In particular, the Sharpe ratio [27] considers both thegenerated profit and the risk (variance) associated with a tradingstrategy. This objective function (Sharpe ratio) is different fromtraditional RL where the goal is to optimize the expected return,usually, without considerations of risk. There are works that proposedrisk-sensitive RL algorithms [21, 12] and variance reduction techniques[1]. The RL algorithms discussed in this disclosure improve upon theseworks, by further reducing variance, for example, through thecombination of using a utility function in updating multiple Q-tables ina Q-learning environment, while also having convergence guarantees andimproved robustness via adversarial learning.

Deep RL has been shown to be brittle is many scenarios [15], therefore,improving robustness is important for deploying agents in realisticscenarios, such as for use in trading platforms. A line of work hasimproved robustness of RL agents via adversarial perturbations [23, 26].For example, the learning framework or system may assume an adversary(who is also learning) is allowed to take over control at regularintervals. This approach has shown good experimental results in robotics[25].

Trading market can be seen as a multi-agent interaction environment.Therefore, the agents in the RL algorithms may be evaluated from theperspective of game theory. However, it may be too difficult to analyzein standard game theoretic framework since there is no normal formrepresentation (commonly used to analyze games). Fortunately, empiricalgame theory [35, 38] overcomes this limitation by using the informationof several rounds of repeated interactions and assuming a higher levelof strategies (agents' policies). These modifications have made possiblethe analysis of multi-agent interactions in complex scenarios such asmarkets [7], and multi-agent games [33]. However, these works have notstudied the interactions under risk metrics (such as Sharpe ratio),which are explored in this disclosure.

In summary, the RL algorithms disclosed, in some embodiments, combinerisk-awareness, variance reduction and robustness techniques. Forexample, a Risk-Averse Averaged Q-Learning (e.g., RA2-Q shown in FIG. 3) and a Variance Reduced Risk-Averse Q-Learning (e.g., RA2.1-Q shown inFIG. 4 ) use risk-averse functions and variance reduction techniques.Then, the training framework is further configured to simulate amulti-agent scenario where an adversary that can perturb the learningprocess, such as, for example, a Risk-Averse Multi-Agent Q-Learning(e.g., RAM-Q in FIG. 5 ) which is a multi-agent version of adversariallearning with strong assumptions and theoretical guarantees. ARisk-Averse Adversarial Averaged Q-Learning (e.g., RA3-Q in FIG. 6 )relaxes those assumptions and proposes a more practical algorithm thatkeeps the multi-agent adversarial component to improve robustness. Atheoretical result is presented using empirical game theory analysis ongames with risk-sensitive payoff.

A computer system is described next in which the various RL algorithmsmay be implemented to train one or more automated agents. FIG. 1A is ahigh-level schematic diagram of an example computer-implemented system100 for training an automated agent having a neural network, exemplaryof embodiments. The automated agent can be instantiated and trained bysystem 100 in manners disclosed herein to generate task requests.

As detailed herein, in some embodiments, system 100 includes featuresadapting it to perform certain specialized purposes, e.g., to functionas a trading platform. In such embodiments, system 100 may be referredto as trading platform 100 or simply as platform 100 for convenience. Insuch embodiments, the automated agent may generate requests for tasks tobe performed in relation to securities (e.g., stocks, bonds, options orother negotiable financial instruments). For example, the automatedagent may generate requests to trade (e.g., buy and/or sell) securitiesby way of a trading venue.

Referring now to the embodiment depicted in FIG. 1A, trading platform100 has data storage 120 storing a model for a reinforcement learningneural network. The model is used by trading platform 100 to instantiateone or more automated agents 180 (e.g., FIG. 1B) that each maintains areinforcement learning neural network 110 (which may be referred to as areinforcement learning network 110 or network 110 for convenience).

A processor 104 is configured to execute machine-executable instructionsto train a reinforcement learning network 110 through a training engine112. The training engine can be configured to generate signals based onone or more rewards or inventives to train automated agents 180 toperform desired tasks more optimally, e.g., to minimize and maximizecertain performance metrics such as risk or variance.

The platform 100 can connect to an interface application 130 installedon user device to receive input data. Trade entities 150 a, 150 b caninteract with the platform to receive output data and provide inputdata. The trade entities 150 a, 150 b can have at least one computingdevice. The platform 100 can train one or more reinforcement learningneural networks 110. The trained reinforcement learning networks 110 canbe used by platform 100 or can be for transmission to trade entities 150a, 150 b, in some embodiments. The platform 100 can process trade ordersusing the reinforcement learning network 110 in response to commandsfrom trade entities 150 a, 150 b, in some embodiments.

The platform 100 can connect to different data sources 160 and databases170 to receive input data and receive output data for storage. The inputdata can represent trade orders. Network 140 (or multiple networks) iscapable of carrying data and can involve wired connections, wirelessconnections, or a combination thereof. Network 140 may involve differentnetwork communication technologies, standards and protocols, forexample.

The platform 100 can include an I/O unit 102, a processor 104,communication interface 106, and data storage 120. The I/O unit 102 canenable the platform 100 to interconnect with one or more input devices,such as a keyboard, mouse, camera, touch screen and a microphone, and/orwith one or more output devices such as a display screen and a speaker.

The processor 104 can execute instructions in memory 108 to implementaspects of processes described herein. The processor 104 can executeinstructions in memory 108 to configure a data collection unit,interface unit (to provide control commands to interface application130), reinforcement learning network 110, training engine 112, and otherfunctions described herein. The processor 104 can be, for example, anytype of general-purpose microprocessor or microcontroller, a digitalsignal processing (DSP) processor, an integrated circuit, a fieldprogrammable gate array (FPGA), a reconfigurable processor, or anycombination thereof.

As depicted in FIG. 1B, automated agent 180 receives input data 185(e.g., from one or more data sources 160 or via a data collection unit)and generates output signal 188 according to its reinforcement learningnetwork 110. In some embodiments, the output signal 188 is transmittedto trade entities 150 a, 150 b for execution of a task. Reinforcementlearning network 110 can refer to a neural network that implementsreinforcement learning.

FIG. 1C is a schematic diagram of an example neural network 200according to some embodiments. The example neural network 200 caninclude an input layer, a hidden layer, and an output layer. The neuralnetwork 200 processes input data using its layers based on reinforcementlearning, for example.

Reinforcement learning is a category of machine learning that configuresagents, such the automated agents 180 described herein, to take actionsin an environment to maximize a notion of a reward, and in someembodiments, the reward can be maximized by minimizing risks orvariances. The processor 104 is configured with machine executableinstructions to instantiate an automated agent 180 that maintains areinforcement learning neural network 110 (also referred to as areinforcement learning network 110 for convenience), and to train thereinforcement learning network 110 of the automated agent 180 using atraining engine 112. The processor 104 is configured to control thereinforcement learning network 110 to process input data in order togenerate output signals. Input data may include trade orders, variousfeedback data (e.g., rewards), or feature selection data, or datareflective of completed tasks (e.g., executed trades), data reflectiveof trading schedules, etc. Output signals may include signals forcommunicating resource task requests, e.g., a request to trade in acertain security. For convenience, a good signal may be referred to as a“positive reward” or simply as a reward, and a bad signal may bereferred as a “negative reward” or as a “punishment.

Referring again to FIG. 1A, the interface application 130 interacts withthe trading platform 100 to exchange data (including control commands)and generates visual elements for display at user device. The visualelements can represent reinforcement learning networks 110 and outputgenerated by reinforcement learning networks 110.

Memory 108 may include a suitable combination of any type of computermemory that is located either internally or externally such as, forexample, random-access memory (RAM), read-only memory (ROM), compactdisc read-only memory (CDROM), electro-optical memory, magneto-opticalmemory, erasable programmable read-only memory (EPROM), andelectrically-erasable programmable read-only memory (EEPROM),Ferroelectric RAM (FRAM) or the like. Data storage devices 120 caninclude memory 108, databases 122, and persistent storage 124.

The communication interface 106 can enable the platform 100 tocommunicate with other components, to exchange data with othercomponents, to access and connect to network resources, to serveapplications, and perform other computing applications by connecting toa network (or multiple networks) capable of carrying data including theInternet, Ethernet, plain old telephone service (POTS) line, publicswitch telephone network (PSTN), integrated services digital network(ISDN), digital subscriber line (DSL), coaxial cable, fiber optics,satellite, mobile, wireless (e.g. Wi-Fi, WiMAX), SS7 signaling network,fixed line, local area network, wide area network, and others, includingany combination of these.

The platform 100 can be operable to register and authenticate users(using a login, unique identifier, and password for example) prior toproviding access to applications, a local network, network resources,other networks and network security devices. The platform 100 may servemultiple users which may operate trade entities 150 a, 150 b.

The data storage 120 may be configured to store information associatedwith or created by the components in memory 108 and may also includemachine executable instructions. The data storage 120 includes apersistent storage 124 which may involve various types of storagetechnologies, such as solid state drives, hard disk drives, flashmemory, and may be stored in various formats, such as relationaldatabases, non-relational databases, flat files, spreadsheets, extendedmarkup files, etc.

Other Practical Applications

As shown in FIG. 1B, automated agent 180 receives input data 185 (e.g.,from one or more data sources 160 or via a data collection unit) andgenerates output signal 188 according to its reinforcement learningnetwork 110. In some embodiments, the output signal 188 can betransmitted to another system, such as a control system, for executingone or more commands represented by the output signal 188.

In some embodiments, once the reinforcement learning network 110 hasbeen trained, it generates output signal 188 reflective of its decisionsto take particular actions in response to input data. Input data caninclude, for example, a set of data obtained from one or more datasources 160, which may be stored in databases 170 in real time or nearreal time.

As a practical example, an HVAC control system which may be configuredto set and control heating, ventilation, and air conditioning units(HVAC) for a building, in order to efficiently manage the powerconsumption of HVAC units, the control system may receive sensor datarepresentative of temperature data in a historical period. The controlsystem may be implemented to use an automated agent 180 and a trainedreinforcement learning network 110 to generate an output signal 188,which may be a resource request command signal 188 indicative of a setvalue or set point representing a most optimal room temperature based onthe sensor data, which may be part of input data 185, representative ofthe temperature data in present and in a historical period (e.g., thepast 72 hours or the past week).

The input data 185 may include a time series data that is gathered fromsensors 160 placed at various points of the building. The measurementsfrom the sensors 1160, which form the time series data, may be discretein nature. For example, the time series data may include a first datavalue 21.5 degrees representing the detected room temperature in Celsiusat time t₁, a second data value 23.3 degrees representing the detectedroom temperature in Celsius at time t₂, a third data value 23.6 degreesrepresenting the detected room temperature in Celsius at time t₃, and soon.

Other input data 185 may include a target range of temperature valuesfor the particular room or space and/or a target room temperature or atarget energy consumption per hour. A reward may be generated based onthe target room temperature range or value, and/or the target energyconsumption per hour.

In some examples, one or more automated agents 180 may be implemented,each agent 180 for controlling the room temperature for a separate roomor space within the building which the HVAC control system ismonitoring.

As another example, in some embodiments, a traffic control system whichmay be configured to set and control traffic flow at an intersection.The traffic control system may receive sensor data representative ofdetected traffic flows at various points of time in a historical period.The traffic control system may use an automated agent 180 and trainedreinforcement learning network 110 to control a traffic light based oninput data representative of the traffic flow data in real time, and/ortraffic data in the historical period (e.g., the past 4 or 24 hours).

The input data 185 may include sensor data gathered from one or moredata sources 160 (e.g. sensors 160) placed at one or more points closeto the traffic intersection. For example, the time series data 112 mayinclude a first data value 3 vehicles representing the detected numberof cars at time t₁, a second data value 1 vehicles representing thedetected number of cars at time t₂, a third data value 5 vehiclesrepresenting the detected number of cars at time t₃, and so on.

Based on a desired traffic flow value at t_(n), the automated agent 180,based on neural network 110, may then generate an output signal 188 toshorten or lengthen a red or green light signal at the intersection, inorder to ensure the intersection is least likely to be congested duringone or more points in time.

As yet another example, the input data 185 may include a set of measuredblood pressure values or blood sugar levels in a time period measured byone or more data sources such as medical devices 160. The trainedreinforcement learning network 110 may receive the input data 185 fromthe sensors 160 or a database 170, and generate an output signal 185representing a predicted data value representing a future blood pressurevalue or a future blood sugar level. The output signal 185 representingthe predicted data value may be transmitted to a health careprofessional for monitoring or medical purposes.

In some embodiments, as another example, an automated agent 180 insystem 100 may be trained to play a video game, and more specifically, alunar lander game 300, as shown in FIG. 2 . In this game, the goal is tocontrol the lander's two thrusters so that it quickly, but gently,settles on a target landing pad. In this example, input data 185provided to an automated agent 180 may include, for example, X-positionon the screen, Y-position on the screen, altitude (distance between thelander and the ground below it), vertical velocity, horizontal velocity,angle of the lander, whether lander is touching the ground (Booleanvariable), etc.

In some embodiments, the reward may indicate a plurality of objectivesincluding: smoothness of landing, conservation of fuel, time used toland, and distance to a target area on the landing pad. The reward,which may be a reward vector, can be used to train the neural network110 for landing the lunar lander by the automated agent 180.

Single-Agent Reinforcement Learning

A Markov Decision Process (MDP) is defined by a set of states

describing the possible configurations, a set of actions

and a set of observations

for each agent. A stochastic policy π_(θ):

×

→[0,1] parameterized by θ produces the next state according to the statetransition function

:

×

→

. The agent obtains rewards as a function of the state and agent'saction r:

×

→

, and receives a private observation correlated with the state o:

→

. The initial states are determined by a distribution d₀:

→[0,

Multi-Agent Reinforcement Learning

In RL, each agent i aims to maximize its own total expected return,e.g., for a Markov game with two agents, for a given initial statedistribution d₀, the discounted returns are respectively:

J ¹(d ₀,π¹,π²)Σ_(t=0) ^(∞)γ^(t)

[r _(t) ¹|π¹,π² ,d ₀]  (1)

J ²(d ₀,π¹,π²)Σ_(t=0) ^(∞)γ^(t)

[r _(t) ²|π¹,π² ,d ₀]  (2)

where γ is a discount factor, r_(t) ¹, r_(t) ², t=1, 2, . . . arerespectively immediate rewards for agent 1 and agent 2. A Nashequilibrium for Markov game (with two agents) is defined below.

Definition 1 [16] A Nash equilibrium point of game (J¹, J²) is a pair ofstrategies (π_(*) ¹, π_(*) ²) such that for ∀s ∈

,

J ¹(s,π _(*) ¹,π_(*) ²)≥J ¹(s,π ¹,π_(*) ²) ∀π¹  (3)

J ²(s,π _(*) ¹,π_(*) ²)≥J ²(s,π _(*) ¹,π²) ∀π²  (4)

Multi-Agent Extension of MDP

A Markov game for N agents is defined by a set of states S describingthe possible configurations of all agents, a set of actions

₁, . . . ,

_(N) and a set of observations

₁, . . . ,

_(N) for each agent. To choose actions, each agent i uses a stochasticpolicy π_(θ) _(i) :

_(i)×

_(i)→[0,1] parameterized by θ_(i), which produces the next stateaccording to the state transition function

:

×

₁×, . . . ×

_(N)→

. Each agent i obtains rewards as a function of the state and agent'saction r_(i):

×

_(i)→

, and receives a private observation correlated with the state o_(i):

→

_(i). The initial states are determined by a distribution d₀:

→[0,

.

Q-learning can use a Q-table to guide an agent to find the best action.A Q-table can be generated based on available [state, action] pairs tothe agent, and updated with appropriate values after an action is takenby the agent during a training step or eposide. This Q-table acts as areference table for the agent to select the most optimal action based oneach value in the table. In multi-agent Q-learning, the Q-tables aredefined over joint actions for each of the agents. Each agent receivesrewards according to its reward function, with transitions dependent onthe actions chosen jointly by the set of agents.

Empirical Game Theory

In some embodiments, the multi-agent behaviours in a trading market canbe analyzed using empirical game theory, where a player corresponds toan agent, and a strategy corresponds to a learning algorithm. Then, in ap-player game, players are involved in a single round strategicinteraction. Each player i can be configured to select a strategy π^(i)from a set of k strategy S^(i)={π₁ ^(i), . . . , π_(k) ^(i)} and receivea stochastic payoff R^(i)(π¹, . . . , π^(p))S¹×S²× . . . ×S^(p)→

. The underlying game that is usually studied is r^(i)(π^(i), . . . ,π^(p))=

[R^(i)(π¹, . . . , π^(p))]. In general, the payoff of player i can bedenoted as pit, and the joint strategy of all players except for playeri can be denoted as x^(−i).

Definition 2 A joint strategy x=(x¹, . . . , x^(p))=(x^(i), x^(−i)) is aNash equilibrium if for all i:

$\begin{matrix}{{{\mathbb{E}}_{\pi \sim x}\left\lbrack {\mu^{i}(\pi)} \right\rbrack} = {\max\limits_{\pi^{i}}{{\mathbb{E}}_{\pi^{- i} \sim x^{- i}}\left\lbrack {\mu^{i}\left( {\pi^{i},\pi^{- i}} \right)} \right\rbrack}}} & (5)\end{matrix}$

Definition 3 A joint strategy x=(x¹, . . . , x^(p))=(x^(i), x^(−i)) isan ϵ-Nash equilibrium if for all i:

$\begin{matrix}{{{\max\limits_{\pi^{i}}{{\mathbb{E}}_{\pi^{- i} \sim x^{- i}}\left\lbrack {\mu^{i}\left( {\pi^{i},\pi^{- i}} \right)} \right\rbrack}} - {{\mathbb{E}}_{\pi \sim x}\left\lbrack {\mu^{i}(\pi)} \right\rbrack}} \leq \epsilon} & (6)\end{matrix}$

Evolutionary dynamics can be used to analyze multi-agent interactions.An example model is replicator dynamics (RD) [36] which describes how apopulation evolves through time under evolutionary pressure (in thepresent disclosure, a population is composed by learning algorithms). RDassumes that the reproductive success is determined by interactions andtheir outcomes. For example, the population of a certain type increasesif they have a higher fitness (in the present disclosure, this means theexpected return in certain interaction) than the population average;otherwise that population share will decrease.

To view the dominance of different strategies, it is common to plot thedirectional field of the payoff tables using the replicator dynamics fora number of strategy profiles x in the simplex strategy space [33].

The embodiments of RL algorithms shown in FIGS. 3 to 6 are mainlysituated in the broad area of safe RL [14]. In some embodiments, therobustness of learned policies can be improved by assuming two opposinglearning processes: one that aims to disturb the most and another onethat tries to control the perturbations [23]. This approach has beenrecently adapted to work with neural networks in the context of deep RL[26]. Moreover, Risk-Averse Robust Adversarial Reinforcement Learning(RARL) [25] extended this idea by combining with Averaged DQN [1], analgorithm that proposes averaging the previous k estimates to stabilizethe training process. RARL trains two agents: protagonist and adversaryin parallel. The goal for the protagonist agent can be set to maximizethe expected return and minimize the variance of the expected return,while the goal for the adversary agent can be set to minimize theexpected return and maximize the variance of the expected return. RARLshowed good experimental results, but lacked theoretical guarantees andtheoretical insights on the variance reduction and robustness.Multi-agent Q-learning [16] is useful for finding the optimal strategywhen there exists a unique Nash equilibrium in general sum stochasticgames, and this approach could also be used in adversarial RL.

Wainwright in [34] proposed a variance reduction Q-learning algorithm(V-QL) which can be seen as a variant of the SVRG algorithm instochastic optimization [17]. Given an algorithm that converges to Q*,one of its iterates Q could be used as a proxy for Q*, and then recenterthe ordinary Q-learning updates by a quantity −

(Q)+

(Q), where

is an empirical Bellman operator,

is the population Bellman operator, which is not computable, but anunbiased approximation of it could be used instead. This algorithm isshown to be convergent and enjoys minimax optimality up to a logarithmicfactor.

In some embodiments, risk-averse objective functions [21] can becombined with the Q-learning algorithm to reduce variance and risk, aselaborated below.

Risk Averse Q-Learning

Shen in [28] proposed a Q-learning algorithm that is shown to convergeto the optimal of a risk-sensitive objective function. In [28], thetraining scheme is the same as Q-learning, except that in eachiteration, a utility function is applied to a temporal difference (TD)error (see e.g., Algorithm 5 in FIG. 8 , further elaborated in theDiscussion section below). Generally speaking, a TD error functionreports back the difference between an estimated reward and the actualreward received at any given state. The larger the error, the larger thedifference between the expected and actual reward.

In order to optimize the expected return as well as minimize thevariance of the expected return, an expected utility of the return canbe used as the objective function instead:

$\begin{matrix}{{\overset{\sim}{J}}_{\pi} = {\frac{1}{\beta}{{{\mathbb{E}}_{\pi}\left\lbrack {\exp\left( {\beta{\sum_{t = 0}^{\infty}{\gamma^{t}r_{t}}}} \right)} \right\rbrack}.}}} & (7)\end{matrix}$

By a straightforward Taylor expansion, Eq.(7) above yields:

${{\mathbb{E}}\left\lbrack {\sum_{t = 0}^{\infty}{\gamma^{t}r_{t}}} \right\rbrack} + {\frac{\beta}{2}{{\mathbb{V}ar}\left\lbrack {\sum_{t = 0}^{\infty}{\gamma^{t}r_{t}}} \right\rbrack}} + {O\left( \beta^{2} \right)}$

where when β<0 the objective function is risk-averse, when β=0 theobjective function is risk-neutral, and when β>0 the objective functionis risk-seeking.

By applying a monotonically increasing concave utility functionu(x)=−exp(βx) where β<0 to the TD error, Algorithm 5 (see e.g., FIG. 8 )converges to the optimal point of Eq. (7). Hence, it can be shown that:

Theorem 1 (Theorem 3.2, [28]) Running Algorithm 5 from an initial Qtable, Q→Q* with probability (w.p.) 1, where Q* is the unique solutionto

${\left. \left. {{\mathbb{E}}_{s^{\prime}}\left\lbrack {{u\left( {r,s,a} \right)} + {{\gamma \cdot \max\limits_{a}}{Q^{*}\left( {s^{\prime},a} \right)}} - {Q^{*}\left( {s,a} \right)}} \right.} \right) \right\rbrack - x_{0}} = 0$

∀(s, a), where s′ is sampled from

[·|s,a], and the corresponding policy π* of Q* satisfies {tilde over(J)}_(π) _(*) ≥{tilde over (J)}_(π) ∀π.

Multi-Agent Q-Learning

[16] proposed Nash-Q, a Multi-Agent Q-learning algorithm (e.g.,Algorithm 6 in FIG. 9 and further elaborated in Discussion sectionbelow) in the framework of general-sum stochastic games. When thereexists a unique Nash equilibrium in the game, this algorithm is usefulfor finding the optimal strategy. Nash-Q assumes an agent can observethe other agent's immediate rewards and previous actions duringlearning. Each learning agent maintains two Q-tables, one for its own Qvalues, and one for the other agents' Q values. [16] showed that understrong assumptions (e.g., Assumption B.3 in Discussion section below),Algorithm 6 converges to the Nash Equilibrium.

Example Embodiments

In some example embodiments, a computer-implemented system for trainingan automated agent may include: a communication interface; at least oneprocessor; memory in communication with the at least one processor;software code stored in the memory, which when executed at the at leastone processor causes the system to: instantiate an automated agent thatmaintains a reinforcement learning neural network and generates,according to outputs of the reinforcement learning neural network,signals for communicating task requests; receive, by way of thecommunication interface, a plurality of training input data including aplurality of states and a plurality of actions for the automated agent;initialize a learning table Q for the automated agent based on theplurality of states and the plurality of actions; compute a plurality ofupdated learning tables based on the initialized learning table Q usinga utility function, the utility function comprising a monotonicallyincreasing concave function; and generate an averaged learning table Q′based on the plurality of updated learning tables.

In some embodiments, the automated agent is configured to select anaction based on the averaged learning table Q′ for communicating one ormore task requests.

In some embodiments, utility function is represented by u(x)=−e^(βx),β<0.

In some embodiments, computing a plurality of updated learning tablesmay include: receiving, by way of the communication interface, an inputparameter k and a training step parameter T; and for each training stept, where t=1, 2 . . . T: computing an interim learning table {circumflexover (Q)} based on the initialized learning table Q; selecting an actiona_(t) from the plurality of actions based on the interim learning table{circumflex over (Q)} and a given state s_(t) from the plurality ofstates; computing a reward r_(t) and a next state s_(t+1) based on theselected action a_(t); and for at least two values of i, where i=1, 2, .. . , k, computing a respective updated learning table Q^(i) of theplurality of updated learning tables based on (s_(t),a_(t), r_(t),s_(t+1)) and the utility function.

In some embodiments, the averaged learning table Q′ is computed as

$\frac{1}{k}{\sum_{i = 1}^{k}{Q^{i}.}}$

In some embodiments, the utility function is a first utility functionand the software code, when executed at the at least one processor,further causes the system to: instantiate an adversarial agent thatmaintains an adversarial reinforcement learning neural network andgenerates, according to outputs of the adversarial reinforcementlearning neural network, signals for communicating adversarial taskrequests; initialize an adversarial learning table Q_(A) for theadversarial agent; compute a plurality of updated adversarial learningtables based on the initialized adversarial learning table Q_(A) using asecond utility function, the second utility function comprising amonotonically increasing convex function; and generate an averagedadversarial learning table Q_(A)′ based on the plurality of updatedadversarial learning tables.

In some embodiments, the adversarial agent is configured to select anadversarial action based on the averaged adversarial learning tableQ_(A)′ to minimize a reward for the automated agent.

In some embodiments, the second utility function is represented byu^(A)(x)=−e^(β) ^(A) ^(x), β^(A)>0.

In some embodiments, computing a plurality of updated adversariallearning tables may include: receiving, by way of the communicationinterface, an input parameter k and a training step parameter T; and foreach training step t, where t=1, 2 . . . T: computing an interimadversarial learning table {circumflex over (Q)}^(A) based on theinitialized adversarial learning table Q^(A); selecting an adversarialaction a_(t) ^(A) based on the interim adversarial learning table{circumflex over (Q)}^(A) and a given state s_(t) from the plurality ofstates; computing an adversarial reward r_(t) ^(A) and a next states_(t+1) based on the selected adversarial action a_(t) ^(A); and for atleast two values of i, where i=1, 2, . . . , k, computing a respectiveupdated adversarial learning table Q^(i) _(A) of the plurality ofupdated adversarial learning tables based on (s_(t),a_(t) ^(A),r_(t)^(A),s_(t+1)) and the second utility function.

In some embodiments, the averaged adversarial learning table Q_(A)′ iscomputed as

$\frac{1}{k}{\sum_{i = 1}^{k}{Q_{A}^{i}.}}$

In some example embodiments, there is a computer-implemented method oftraining an automated agent, the method may include: instantiating anautomated agent that maintains a reinforcement learning neural networkand generates, according to outputs of the reinforcement learning neuralnetwork, signals for communicating task requests; receiving, by way ofthe communication interface, a plurality of training input dataincluding a plurality of states and a plurality of actions for theautomated agent; initializing a learning table Q for the automated agentbased on the plurality of states and the plurality of actions; computinga plurality of updated learning tables based on the initialized learningtable Q using a utility function, the utility function comprising amonotonically increasing concave function; and generating an averagedlearning table Q′ based on the plurality of updated learning tables.

In some embodiments, the method may further include: selecting anaction, by the automated agent, based on the averaged learning table Q′for communicating one or more task requests.

In some embodiments, the utility function is represented byu(x)=−e^(βx), β<0.

In some embodiments, computing a plurality of updated learning tablesmay include: receiving, by way of the communication interface, an inputparameter k and a training step parameter T; and for each training stept, where t=1, 2 . . . T: computing an interim learning table {circumflexover (Q)} based on the initialized learning table Q; selecting an actiona_(t) from the plurality of actions based on the interim learning table{circumflex over (Q)} and a given state s_(t) from the plurality ofstates; computing a reward r_(t) and a next state s_(t+1) based on theselected action a_(t); and for at least two values of i, where i=1, 2, .. . , k, computing a respective updated learning table Q^(i) of theplurality of updated learning tables based on (s_(t),a_(t), r_(t),s_(t+1)) and the utility function.

In some embodiments, the averaged learning table Q′ is computed as

$\frac{1}{k}{\sum_{i = 1}^{k}{Q^{i}.}}$

In some embodiments, method may further include: instantiating anadversarial agent that maintains an adversarial reinforcement learningneural network and generates, according to outputs of the adversarialreinforcement learning neural network, signals for communicatingadversarial task requests; initializing an adversarial learning tableQ_(A) for the adversarial agent; computing a plurality of updatedadversarial learning tables based on the initialized adversariallearning table Q_(A) using a second utility function, the second utilityfunction comprising a monotonically increasing convex function; andgenerating an averaged adversarial learning table Q_(A)′ based on theplurality of updated adversarial learning tables.

In some embodiments, the method may further include selecting anadversarial action by the adversarial agent based on the averagedadversarial learning table Q_(A)′ to minimize a reward for the automatedagent.

In some embodiments, the second utility function is represented byu^(A)(x)=−e^(β) ^(A) ^(x), β^(A)>0.

In some embodiments, computing a plurality of updated adversariallearning tables may include: receiving, by way of the communicationinterface, an input parameter k and a training step parameter T; and foreach training step t, where t=1, 2 . . . T: computing an interimadversarial learning table {circumflex over (Q)}_(A) based on theinitialized adversarial learning table Q_(A); selecting an adversarialaction a_(t) ^(A) based on the interim adversarial learning table{circumflex over (Q)}_(A) and a given state s_(t) from the plurality ofstates; computing an adversarial reward r_(t) ^(A) and a next states_(t+1) based on the selected adversarial action a_(t)A; and for atleast two values of i, where i=1, 2, . . . , k, computing a respectiveupdated adversarial learning table Q^(i) _(A) of the plurality ofupdated adversarial learning tables based on (s_(t),a_(t) ^(A),r_(t)^(A),s_(t+1)) and the second utility function.

In some embodiments, the averaged adversarial learning table Q_(A)′ iscomputed as

$\frac{1}{k}{\sum_{i = 1}^{k}{Q_{A}^{i}.}}$

More specifically, a Risk-Averse Averaged Q-Learning (e.g., RA2-Q shownin FIG. 3 ) and a Variance Reduced Risk-Averse Q-Learning (e.g., RA2.1-Qshown in FIG. 4 ) use a risk-averse utility functions and reducevariance by training multiple Q tables in parallel. Then, the trainingframework is further configured to simulate a multi-agent scenario wherean adversary that can perturb the learning process, such as, forexample, a Risk-Averse Multi-Agent Q-Learning (e.g., RAM-Q in FIG. 5 )which is a multi-agent algorithm that assumes an adversary which canperturb the learning process. While RAM-Q is proven to have convergenceguarantees, it also needs strong assumptions that might not hold inreality. A Risk-Averse Adversarial Averaged Q-Learning (e.g., RA3-Q inFIG. 6 ) relaxes those strong assumptions and proposes a more practicalalgorithm that keeps the multi-agent adversarial component to improverobustness.

Table 1 below briefly summarizes each of the four machine learningmodels and their respective convergence guarantees (or lack thereof).

TABLE 1 Summary of Four Machine Learning RL Models Model BriefDescription Convergence Risk-Averse Averaged Q-Learning with a utilityfunction; and a Convergence to optimal of a Q-Learning (RA2-Q) morestable choice of actions with risk-averse objective multiple Q tablesfunction and reduced training variance Variance Reduced Risk- Useaverage estimation of multiple No convergence guarantee AverseQ-Learning tables in Q updates; and apply utility (RA2.1-Q) function inQ updates Risk-Averse Multiagent Multi-agent Nash Q-Learning with aConvergence similar to [16] Q-learning (RAM-Q) utility function; arisk-averse protagonist agent and a risk-seeking adversary agent; andmultiple Q tables Risk-Averse Adversarial Multi-agent Q-Learning with autility No convergence guarantee Averaged Q-Learning function; arisk-averse protagonist (RA3-Q) agent and a risk-seeking adversaryagent; and multiple Q tables

Risk-Averse Averaged Q-Learning (RA2-Q)

FIG. 3 shows an example risk-averse averaged Q-learning algorithm(RA2-Q), in accordance with an embodiment. As shown, input training datamay include: a number of training steps T; an exploration rate ϵ; anumber of models k; risk control parameter λ_(P); and a utility functionparameter β. A Q-table for the automated agent is initialized, e.g.,Q^(i)=0. Other values may be initialized as well: e.g., N^(i)=0, α^(i)=1for ∀^(i)=1, . . . , k. A Replay Buffer may also be set: RB=Ø, andrandomly sample action choosing head integers H ∈ [1, k].

From t=1 to T, for each value of t (“while in the t loop”): Q=Q^(H), thetraining system may compute an interim Q-table {circumflex over (Q)} by

$\begin{matrix}{{\hat{Q}\left( {s,a} \right)} = {{Q\left( {s,a} \right)} - {\lambda_{p} \cdot \frac{\sum_{i = 1}^{k}\left( {{Q^{i}\left( {s,a} \right)} - {\overset{\_}{Q}\left( {s,a} \right)}} \right)^{2}}{k - 1}}}} & (8)\end{matrix}$

where λ_(P)>0 is a constant; and

${\overset{\_}{Q}\left( {s,a} \right)} = {\frac{1}{k}{\sum_{i = 1}^{k}{Q^{i}\left( {s,a} \right)}}}$

Next, while in the t loop, the training system may select action a_(t)according to {circumflex over (Q)} by applying ϵ-greedy strategy,execute the action and get (s_(t),a_(t), r_(t), s_(t+1)), which can beappended to the replay buffer RB=RB ∪ {(s_(t),a_(t), r_(t), s_(t+1))}.

The training system may, while in the t loop, generate a mask M ∈

^(k)˜Poisson(1), and for i=1, . . . , k, for each value of i (“while inthe i loop”):

if and when M_(i)=1, update the learning table Q^(i) by

$\begin{matrix}{{Q^{i}\left( {s_{t},a_{t}} \right)} = {{Q^{i}\left( {s_{t},a_{t}} \right)} + {{\alpha^{i}\left( {s_{t},a_{t}} \right)} \cdot \left\lbrack {{u\left( {{r\left( {s_{t},a_{t}} \right)} + {{\gamma \cdot \max\limits_{a}}{Q^{i}\left( {s_{t + 1},a} \right)}} - {Q^{i}\left( {s_{t},a_{t}} \right)}} \right)} - x_{0}} \right\rbrack}}} & (9)\end{matrix}$

where u is a utility function configured to minimize risks, and x₀=−1.

In some embodiments, the utility function u may be a monotonicallyincreasing concave function in order to minimize risks (and maximizereward) for the automated agent. For example, an example utilityfunction u(x) can be:

If x<=0, u(x)=0.5x; or

If x>0, u(x)=0.1x.

For another example, the utility function u may be u(x)=−e^(βx) whereβ<0.

Next, while in the i loop, the training system may update N^(i) by N^(i)(s_(t),a_(t))=N^(i) (s_(t),a_(t))+1; update learning rate

${\alpha^{i}\left( {s_{t},a_{t}} \right)} = {\frac{1}{N^{i}\left( {s_{t},a_{t}} \right)}.}$

Outside of the i loop but still while in the t loop, the training systemmay update H by randomly sampling integers from 1 to k.

Once outside of the t loop, the training system may generate theaveraged Q-learning table

$Q^{\prime} = {\frac{1}{k}{\sum_{i = 1}^{k}{Q^{i}.}}}$

With Algorithm 5, even though the convergence to the optimal ofrisk-sensitive objective function is in theory probable with aprobability of 1, the proof assumes visiting every state infinitely manytimes whereas the actual training time is finite. The RA2-Q algorithmabove can reduce the training variance further by choosing morerisk-averse actions during the finite training process.

The RA2-Q algorithm trains multiple Q tables in parallel and reducestraining variance by averaging multiple Q tables in the update.Moreover, in order to obtain convergence guarantee, k Q tables aretrained and updated in parallel using Eq. (9) above as the update rule.To select more stable actions, the sample variance of k Q tables can beused as an approximation to the true variance and then a risk-averse{circumflex over (Q)} table (e.g., an interim Q-table) can be computed.The risk-averse Q table can then be used to select actions.

The objective function here is Eq. (7), and it can be shown that RA2-Qalgorithm (also known as Algorithm 1) also converges to the optimal.

Theorem 2 Running RA2-Q algorithm for an initial Q table, then for all i∈ {1, . . . , k}, Q^(i)→Q* w.p. 1, hence the returned table

${{\frac{1}{k}{\sum_{i = 1}^{k}Q^{i}}}\rightarrow{Q^{*}{w.p}\text{.1}}},$

where Q* is the unique solution to

${{{\mathbb{E}}_{s^{\prime}}\left\lbrack {u\left( {{r\left( {s,a} \right)} + {{\gamma \cdot \max\limits_{a}}{Q^{*}\left( {s^{\prime},a} \right)}} - {Q^{*}\left( {s,a} \right)}} \right)} \right\rbrack} - x_{0}} = 0$

for all (s, a), where s′ is sampled from

[·|s, a], and the corresponding policy π* of Q* satisfies {tilde over(J)}_(π*)≥{tilde over (J)}_(π) ∀π.

Theorem 2 follows directly from Theorem 1 (e.g., see Discussion sectionbelow for details).

Variance Reduced Risk-Averse Q-Learning (RA2.1-Q)

FIG. 4 shows an example variance reduced risk-averse Q-learningalgorithm (RA2.1-Q), in accordance with an embodiment. The input datainclude: training epochs T; exploration rate E; number of models k;epoch length K; recentering sample size N; and utility functionparameter β<0. The training system can Initialize a number of values: Q₀=0; m=1.

From m=1 to T for each value of m (“while in the m loop”): the trainingsystem selects an action according to Q _(m−1) by applying ϵ-greedystrategy, execute the action and get (s,a,r(s,a),s′), update the replaybuffer RB=RB ∪ (s,a,r(s,a),s′).

While in the m loop, from i=1, . . . , N for each value of i (“while inthe i loop”): the system defines the empirical Bellman operator

_(i) as

${{{\overset{¨}{\mathcal{J}}}_{i}(Q)}\left( {s,a} \right)} = {{u\left( {{r\left( {s,a} \right)} + {{\gamma \cdot \max\limits_{a^{\prime}}}{Q\left( {s_{i},a^{\prime}} \right)}}} \right)} - x_{0}}$

where s_(i) is randomly sampled from

[·|s,a]; u is the utility function, and and x₀=−1.

In some embodiments, the utility function u may be a monotonicallyincreasing concave function in order to minimize risks (and maximizereward) for the automated agent. For example, an example utilityfunction u(x) can be:

If x<=0, u(x)=0.5x; or

If x>0, u(x)=0.1x.

For another example, the utility function u may be u(x)=−e^(βx) whereβ<0.

Once outside of the i loop, the system defines

${{{\overset{\sim}{\mathcal{J}}}_{N}\left( {\overset{\_}{Q}}_{m - 1} \right)} = {\frac{1}{N}{\sum_{i \in \mathcal{D}_{N}}{{\overset{¨}{\mathcal{J}}}_{i}\left( {\overset{\_}{Q}}_{m - 1} \right)}}}},$

where

_(N) is a collection of N i.i.d. samples (i.e., matrices with samplesfor each state-action pair (s,a) from RB). Define Q₁=Q _(m−1).

From k=1, . . . , K for each value of k (“while in the k loop”): thesystem computes stepsize

$\lambda_{k} = \frac{1}{1 + {\left( {1 - \gamma} \right)k}}$

and

Q _(k+1)=(1−λ_(k))·Q _(k)+λ_(k)·[

k(Q _(k))−

_(k)( Q _(m−1))+

_(N)(Q _(m−1))]  (10)

where

_(k) is empirical Bellman operator constructed using a sample not in

_(N), thus the random operators

_(k) and

_(N) are independent.

Once outside of the k loop: Q _(m)=Q_(K+1); m=m+1. Then once outside ofthe m loop, the system returns Q _(m).

[34] proposed Variance Reduced Q-learning which trains multiple Q tablesin parallel and uses the averaged Q table in the update rule. It isshown that it guarantees a convergence rate which is minimax optimal.The RA2.1-Q algorithm improves upon [34] by applying a utility functionto the TD error during Q updates for the purpose of further reducingvariance. To select more stable actions during training, the samplevariance of k Q tables are used as an approximation to the true varianceand a risk-averse {circumflex over (Q)} table is computed. Therisk-averse {circumflex over (Q)} table can be used to select actions.

Multi-Agent Risk-Averse Q-Learning (RAM-Q)

FIG. 5 shows an example risk-averse multi-agent Q-learning (RAM-Q)algorithm, in accordance with an embodiment. The input data may include:training steps T; exploration rate ϵ; number of models k; utilityfunction parameters β^(P)<0; β^(A)>0.

For ∀(s,a_(P),a_(A)), the system can initialize Q^(P)(s,a_(P),a_(A))=0;Q^(A)(s,a_(P),a_(A))=0; N(s,a_(P),a_(A))=0.

From t=1 to T, for each value of t (“while in the t loop”): at states_(t), the system computes π^(P)(s_(t)), π^(A)(s_(t)), which is a mixedstrategy Nash equilibrium solution of the bimatrix game (Q^(P)(s_(t)),Q^(A)(s_(t))). The system (or the automated agent) selects an actiona_(t) ^(P) based on π^(P)(s_(t)) according to ϵ-greedy strategy andselects an adversarial action at based on π^(A)(s_(t)) according toϵ-greedy strategy. The system observes and computes r_(t) ^(P), r_(t)^(A) and s_(t+1).

While in the t loop, at state s_(t+1), the system computes π^(P)(s_(t+1)), π^(A)(s_(t+1)), which are mixed strategies Nash equilibriumsolutions of the bimatrix game (Q^(P)(s_(t+1)), Q^(A)(s_(t+1)))N(s_(t),a_(t) ^(P),a_(t) ^(A))=N(s_(t),a_(t) ^(P),a_(t) ^(A))+1. Thesystem sets learning rate

$\alpha_{t} = {\frac{1}{N\left( {s_{t},a_{t}^{P},a_{t}^{A}} \right)}.}$

The system then updates Q^(P), Q^(A) such that:

Q ^(P)(s _(t) ,a _(t) ^(P) ,a _(t) ^(A))=Q ^(P)(s _(t) ,a _(t) ^(P) ,a_(t) ^(A))+α_(t)·[u ^(P)(r _(t) ^(P)+γ·π^(P)(s _(t+1))Q ^(P)(s_(t+1))π^(A)(s _(t+1))−Q ^(P)(s _(t) ,a _(t) ^(P) ,a _(t) ^(A)))−x₀]  (11)

where u^(P) is a utility function and x₀=−1.

Q ^(A)(s _(t) ,a _(t) ^(P) ,a _(t) ^(A))=Q ^(A)(s _(t) ,a _(t) ^(P) ,a_(t) ^(A))+α_(t)·[u ^(A)(r _(t) ^(A)+γ·π^(P)(s _(t+1))Q ^(A)(s_(t+1))π^(A)(s _(t+1))−Q ^(A)(s _(t) ,a _(t) ^(P) ,a _(t) ^(A)))−x₁]  (12)

where u^(A) is a utility function, x₁=1.

Outside of the t loop, the system then returns (Q^(P), Q^(A)).

In some embodiments, the utility function u^(P) may be a monotonicallyincreasing concave function in order to minimize risks (and maximizereward) for the automated agent. For example, an example utilityfunction u^(P)(x) can be:

If x<=0, u ^(P)(x)=0.5x; or

If x>0, u ^(P)(x)=0.1x.

For another example, the utility function u^(P) may be u^(P)(x)=−e^(β)^(p) _(x) where β^(p)<0.

In some embodiments, the utility function u^(A) may be a monotonicallyincreasing convex function in order to maximize risks (and minimizereward). For example, the utility function may be u^(A)(x)=e^(β) ^(A)_(x) where β^(A)>0.

In complex scenarios such as financial markets learned RL policies canbe brittle. To improve robustness adversarial learning is implemented toa multi-agent learning problem in RAM-Q algorithm.

In the adversarial setting, it is assumed that there are two learningprocesses happening simultaneously, a main protagonist (P) and anadversary (A): the goal of protagonist is to maximize the total returnas well as minimize the variance; the goal of adversary is to minimizethe total return of protagonist as well as maximizing the variance.Here, one assumption is that each agent can observe its opponent'simmediate reward.

Let r_(t) ^(p) be the immediate reward received by protagonist at stept, and let r_(t) ^(A) be the immediate reward received by adversary atstep t. Then the objective functions may be chosen as follows:

The objective function for the protagonist is,

$\begin{matrix}{{\overset{\sim}{J}}_{\pi}^{P} = {{\frac{1}{\beta^{P}}{{\mathbb{E}}_{\pi}\left\lbrack {\exp\left( {\beta^{P}{\sum_{t = 0}^{\infty}{\gamma^{t} \cdot r_{t}^{P}}}} \right)} \right\rbrack}\beta^{P}} < 0}} & (13)\end{matrix}$

By a Taylor expansion, Eq. (13) yields:

${\overset{\sim}{J}}_{\pi}^{P} = {{{\mathbb{E}}\left\lbrack {\sum_{t = 0}{\gamma^{t} \cdot r_{t}^{P}}} \right\rbrack} + {\frac{\beta^{P}}{2}{{\mathbb{V}ar}\left\lbrack {\sum_{t = 0}{\gamma^{t} \cdot r_{t}^{P}}} \right\rbrack}} + {{O\left( \left( \beta^{P} \right)^{2} \right)}.}}$

Similarly, the objective function for the adversary is,

$\begin{matrix}{{\overset{\sim}{J}}_{\pi}^{A} = {{\frac{1}{\beta^{A}}{{\mathbb{E}}_{\pi}\left\lbrack {\exp\left( {\beta^{A}{\sum_{t = 0}^{\infty}{\gamma^{t}r_{t}^{A}}}} \right)} \right\rbrack}\beta^{A}} > 0}} & (14)\end{matrix}$

and by Taylor expansion, Eq. (14) yields,

${\overset{\sim}{J}}_{\pi}^{A} = {{{\mathbb{E}}\left\lbrack {\sum_{t = 0}{\gamma^{t} \cdot r_{t}^{A}}} \right\rbrack} + {\frac{\beta^{A}}{2}{{\mathbb{V}ar}\left\lbrack {\sum_{t = 0}{\gamma^{t} \cdot r_{t}^{A}}} \right\rbrack}} + {{O\left( \left( \beta^{A} \right)^{2} \right)}.}}$

Then the following guarantee holds:

Theorem 3 If the two-agent game ({tilde over (J)}^(P),{tilde over(J)}^(A)) has a Nash equilibrium solution, then running the RAM-Qalgorithm from initial Q tables Q^(P), Q^(A) will converge to Q_(P)* andQ_(A)* w.p. 1. s.t. the Nash equilibrium solution (π_(*) ^(P), π_(*)^(A)) for the bimatrix game (Q_(P)*, Q_(A)*) is the Nash equilibriumsolution to the game ({tilde over (J)}_(π) ^(P),{tilde over (J)}_(π)^(A)), and the equilibrium payoff are {tilde over (J)}^(P)(s,π_(*)^(P),π_(*) ^(A)), {tilde over (J)}^(A)(s,π_(*) ^(P),π_(*) ^(A)).

Although the RAM-Q algorithm gives a solid convergence guarantee, itsuffers from drawbacks like expensive computational cost and idealizedassumptions, e.g., in trading markets, there may not exist a Nashequilibrium to ({tilde over (J)}^(P),{tilde over (J)}^(A)), and duringthe training process, assumptions about the Nash equilibrium (e.g.,Assumption B.3 in Discussion section below) may break [8]. Hence,another algorithm RA3-Q is developed, which relaxes these assumptions(likely at the expense of loosing theoretical guarantees) whileenhancing robustness and performing well in reality.

Risk-Averse Adversarial Averaged Q-Learning (RA3-Q)

FIG. 6 shows an example risk-averse adversarial averaged Q-learning(RA3-Q) algorithm, in accordance with an embodiment. The input dataincludes: training steps T; exploration rate ϵ; number of models k; riskcontrol parameters λ_(P),λ_(A); and utility function parameters β^(P)<0;β^(A)>0.

The training system can Initialize Q_(P) ^(i), Q_(A) ^(i) ∀i=1, . . . ,k; N=0 ∈

. The system then randomly samples action choosing head integers H_(P),H_(A) ∈ {1, . . . , k}.

From t=1 to T, for each value oft (“while in the t loop”): the systemsets Q_(P)=Q_(P) ^(H) ^(P) , then computes {circumflex over (Q)}_(P),the risk-averse protagonist Q table by the k Q tables Q_(A) ^(i), i=1, .. . , k. The system also sets Q_(A)=Q_(A) ^(H) ^(A) , then computes{circumflex over (Q)}_(A), the risk-seeking protagonist Q table by the kQ tables Q^(i) _(A), i=1, . . . , k.

Next, the system selects actions a_(P), a_(A) according to {circumflexover (Q)}_(P), {circumflex over (Q)}_(A) by applying ϵ-greedy strategyand generates Poisson mask M ∈

^(k)˜Poisson(1). The system updates Q^(i) _(P), Q^(i) _(A), i=1, . . . ,k according to mask M using update rules Eq. (11) and Eq. (12). Thesystem then updates H_(P) and H_(A).

Once outside of the t loop, the system returns

$\frac{1}{k}{\sum_{i = 1}^{k}{Q_{P}^{i}{and}\frac{1}{k}{\sum_{i = 1}^{k}{Q_{A}^{i}.}}}}$

In RA3-Q, the objective function for the protagonist agent is Eq. (13),and the objective function for the adversary agent is Eq. (14). In orderto optimize {tilde over (J)}^(P) and {tilde over (J)}^(A), utilityfunctions are applied to TD errors when updating Q tables, and trainingmultiple Q tables in parallel is used to select actions with lowvariance. The full version of RA3-Q is Algorithm 7 in FIGS. 10A and 10B.

RA3-Q combines (i) risk-averse using utility functions, (ii) variancereduction by having multiple Q tables, and (iii) robustness byadversarial learning. Intuitively, as the adversary is getting stronger,the protagonist experiences harder challenges, thus enhancingrobustness. Compared to RAM-Q, where the returned policy (π^(P),π^(A))is a Nash equilibrium of the ({tilde over (J)}^(P),{tilde over(J)}^(A)), RA3-Q does not have a convergence guarantee, however, it hasseveral practical advantages including computational efficiency,simplicity (e.g., no strong assumptions) and more stable actions duringtraining. For a longer discussion see Discussion section, below.

Running RA3-Q from an initial Q table,

${{\frac{1}{k}{\sum_{i = 1}^{k}Q_{P}^{i}}}\rightarrow{Q^{P*}{w.p}\text{.1}}},$

where Q^(P)* is a solution to

${{\mathbb{E}}_{s_{t},a_{P},a_{A}}\left\lbrack {u^{P}\left( {r_{t}^{P} + {{\gamma \cdot \max\limits_{a}}{Q^{P*}\left( {s_{t + 1},a_{P},a_{A}} \right)}} - {Q^{P*}\left( {s_{t},a_{P},a_{A}} \right)}} \right)} \right\rbrack} = x_{0}$

∀(s_(t),a_(P),a_(A)), where s_(t+1) is sampled from

[·|s _(t),a_(P),a_(A)], and the corresponding policy π*_(P) of Q^(P)*satisfies {tilde over (J)}_(π*) _(P) ≥{tilde over (J)}_(π) ∀π. Inaddition,

${{\frac{1}{k}{\sum_{i = 1}^{k}Q_{A}^{i}}}\rightarrow{Q^{A*}{w.p}\text{.1}}},$

where Q^(A)* is a solution to

${{\mathbb{E}}_{s_{t},a_{P},a_{A}}\left\lbrack {u^{A}\left( {r_{t}^{A} + {{\gamma \cdot \max\limits_{a}}{Q^{P*}\left( {s_{t + 1},a_{P},a_{A}} \right)}} - {Q^{P*}\left( {s_{t},a_{P},a_{A}} \right)}} \right)} \right\rbrack} = x_{1}$

∀(s_(t),a_(P),a_(A)). Where s_(t+1) is sampled from

[·|s_(t),a_(P),a_(A)]. And the corresponding policy π*_(A) of Q^(A)*satisfies {tilde over (J)}_(π*) _(A) ^(A)≥{tilde over (J)}_(π) ^(A) ∀π.

Performance Evaluated by Empirical Game Theory

When the environment is populated by many learning agents, empiricalgame theory (EGT) may be used to evaluate the performance of the agents.

In EGT each agent is a player involved in rounds of strategicinteraction (games). By meta-game analysis, we can evaluate thesuperiority of each strategy. Our contribution is to theoretically provethat the Nash-Equilibrium of risk averse meta-game is an approximationof the Nash-Equilibrium of the population game, to our knowledge, thisis the first work doing this type of risk-averse analysis.

In EGT, the dominance of strategies can be visualized by plotting themeta-game payoff tables together with the replicator dynamics. A metagame payoff table could be seen as a combination of two matrices (U|N),where each row N_(i) contains a discrete distribution of p players overk strategies, and each row yields a discrete profile (n_(π) ₁ , n_(π)_(k) ) indicating exactly how many players play each strategy with Σ_(j)n_(π) _(j) =p. A strategy profile

$u = {\left( {\frac{n_{\pi_{1}}}{p},\ldots,\ \frac{n_{\pi_{k}}}{p}} \right).}$

And each row R_(i) captures the rewards corresponding to the rows in N.

For example, for a game A with 2 players, and 3 strategies {π₁, π₂, π₃}to choose from, the meta game payoff table could be constructed asfollows: in the left side of the table, all of the possible combinationsof strategies are listed. If there are p players and k strategies, thenthere are

$\begin{pmatrix}{p + k - 1} \\p\end{pmatrix}$

rows, hence in game A, there are 6 rows.

Once a meta-game payoff table and the replicator dynamics are obtained,a directional field plot is computed where arrows in the strategy spaceindicates the direction of flow, or change, of the populationcomposition over the strategies (see Discussion section below for twoexamples of directional field plots in multi-agent problems).

Previously, [33] showed that for a game r^(i)(π^(i), . . . , π^(p))=

[R^(i)(π¹, . . . , π^(p))], with a meta-payoff (empirical payoff){circumflex over (r)}^(i)(π^(i), . . . , η^(p)), the Nash Equilibrium of{circumflex over (r)} is an approximation of Nash Equilibrium of r.

Lemma 1 [33] If x is a Nash Equilibrium for the game {circumflex over(r)}^(i) (π¹, . . . , π^(p)), then it is a 2ϵ-Nash equilibrium for thegame r^(i)(π¹, . . . , π^(p)), where

$\in {= {\sup\limits_{\pi,i}{{❘{{{\overset{\hat{}}{r}}^{i}(\pi)} - {r^{i}(\pi)}}❘}.}}}$

Lemma 1 implies that if for each player, we can bound the estimationerror of empirical payoff, then we can use the Nash Equilibrium of metagame as an approximation of Nash Equilibrium of the game.

As the objective is to consider risk averse payoff to evaluatestrategies, instead of

r ^(i)(π^(i), . . . ,π^(p))=

[R ^(i)(π¹, . . . ,π^(p))],

The following equation

h ^(i)(π¹, . . . ,π^(p))=

[R ^(i)(π1, . . . ,π^(p))]−β·

ar[R ^(i)(π¹, . . . ,π^(p))]

(where β>0) is chosen as the game payoff.

Moreover, the following equation

$\begin{matrix}{{{\overset{\hat{}}{h}}^{i}\left( {\pi^{i},\ldots,\pi^{p}} \right)} = {{\overset{¯}{R}}^{i} - {\beta \cdot \left\lbrack {\frac{1}{n - 1}{\sum\limits_{j = 1}^{n}\left( {R_{j}^{i} - {\overset{¯}{R}}^{i}} \right)^{2}}} \right\rbrack}}} & (15)\end{matrix}$

is chosen as meta-game payoff, where

${\overset{¯}{R}}^{i} = {\frac{1}{n}{\sum\limits_{j = 1}^{n}R_{j}^{i}}}$

and R_(j) ^(i) is the stochastic payoff of player i in j-th experiment.

To the inventors' knowledge, there is no previous work on empirical gametheory analysis with risk sensitive payoff. Below a theoretical analysisis presented to show that for the risk-averse payoff game, the NashEquilibrium can still be approximated by a meta game.

Theorem 4 Under Assumption G.4, for a Normal Form Game with p players,and each player i chooses a strategy π^(i) from a set of strategiesS^(i)={π₁ ^(i), . . . , π_(k) ^(i)} and receives a meta payoff h^(i)(π¹,. . . ,π^(p)) (Eq. (15)). If x is a Nash Equilibrium for the gameĥ^(i)(π¹, . . . ,π^(p)), then it is a 2ϵ-Nash equilibrium for the gameh^(i)(π¹, . . . ,π^(p)) with probability 1-δ if the game is played for ntimes, where

$\begin{matrix}{n \geq {\max\left\{ {{{- \frac{8R^{2}}{\in^{2}}}{\log\left\lbrack {\frac{1}{4}\left( {1 - \left( {1 - \delta} \right)^{\frac{1}{{❘s^{1}❘} \times \ldots \times {❘s^{p}❘} \times p}}} \right)} \right\rbrack}},\frac{64\beta^{2}{\omega^{2} \cdot {\Gamma(2)}}}{\in^{2}\left\lbrack {1 - \left( {1 - \delta} \right)^{\frac{1}{|S^{1}|{\times \ldots \times}|{Sp}|{\times p}}}} \right\rbrack}}\} \right.}} & (16)\end{matrix}$

Experiments

The experiments are conducted using the open-sourced ABIDES [11] marketsimulator in a simplified setting. The environment is generated byreplaying publicly available real trading data for a single stock ticker(see e.g., https://lobsterdata.com/info/DataSamples.php. The settingincludes one non-learning agent that replays the marketdeterministically [3] and learning agents. The learning agentsconsidered are: RAQL (i.e., Algorithm 5), RA2-Q (i.e., Algorithm 1),RA2.1-Q (i.e., Algorithm 2), and RA3-Q (i.e., Algorithm 4).

A setting similar to existing implementations in ABIDES (see e.g.,https://github.com/abides-sim/abides/blob/master/agent/examples/QLearningAgent.py)is used where the state space is defined by two features: currentholdings and volume imbalance. Agents take one action at every time step(every second) selecting among: buy/sell with limit price base+i·K,where i ∈ {1, 2, . . . , 6} or do nothing. The immediate reward isdefined by the change in the value of the portfolio (mark-to-market) andcomparing against the previous time step. The comparisons are in termsof Sharpe ratio, which is a widely used measure in trading markets.

Table 2 below shows the meta-payoff table of a two player-game amongthree strategies: RAQL, RA2-Q and RA2.1-Q. The results show that the twoproposed algorithms RA2-Q and RA2.1-Q obtained better results than RAQL.With those payoffs, the directional plot 700 and the trajectory plot 710shown in FIG. 7 , where black solid circles denote globally-stableequilibria, and the white circles denote unstable equilibria(saddle-points). In the directional plot 700, the plot is coloredaccording to the speed at which the strategy mix is changing at eachpoint; and in the trajectory plot 710, the lines show trajectories forsome points over the simplex.

TABLE 2 Meta-payoff of 2 players, 3 strategies, respectively RAQL, RA2-Qand RA2.1-Q over 80 simulations. The return here used is Sharpe Ratio.N_(i1) N_(i2) N_(i3) R_(i1) R_(i2) R_(i3) 2 0 0 0.9130 0 0 1 1 0 0.73110.7970 0 0 2 0 0 1.0298 0 1 0 1 0.6791 0 1.0786 0 0 2 0 0 2.2177 0 1 1 00.7766 1.4386

Table 2 shows RA2-Q and RA3-Q in terms of robustness. In this settingboth agents are trained under the same conditions as a first step. Thenin testing phase, two types of perturbations, an adversarial agent(trained within RA3-Q) and a noise agent (i.e., zero-intelligence) areadded in the environment. The results are presented in Table 3 below, interms of Sharpe ratio using cross validation with 80 experiments.

Table 3 shows comparison, again in terms of Sharpe ratio, with two typesof perturbations: the trained adversary from RA3-Q is used in testingtime, and zero-intelligence agents. It can be seen that RA3-Q obtainsbetter results in both cases due to its enhanced robustness.

TABLE 3 Algorithim/Setting Adversarial Perturation ZI AgentsPerturbation RA2-Q 0.5269 0.9538 RA3-Q 0.9347 1.0692

Discussion

As mentioned earlier, RA2.1-Q in theory does not a convergenceguarantee, however, it obtained good empirical results (better than RAQLand RA2-Q). It is an open question whether RA2.1-Q converges to theoptimal of Eq.(7). Furthermore, it may be explored whether RA2.1-Q alsoenjoys minimax optimality convergence rate up to a logarithmic factor asin [34]. Similarly, RA3-Q does not have a convergence guarantee in themulti-agent learning scenario (when protagonist and adversary agents arelearning simultaneously). However, RA3-Q obtained better empiricalresults than RA2-Q highlighting its robustness. In the text below, it isshown that Eq. (81) or Eq. (82) converges to optimal assuming the policyfor the adversary (or protagonist) is fixed (thus, it is no longer amulti-agent learning setting).

In terms of the EGT analysis, the analysis uses a risk-averse measurebased on variance (second moment), studying higher moments and othermeasures may be possible.

In this disclosure, four new different Q-learning algorithms arepresented that augment reinforcement learning agents withrisk-awareness, variance reduction, and robustness. RA2-Q and RA2.1-Qare risk-averse but use slightly different techniques to reducevariance. RAM-Q and RA3-Q are two algorithms that extend the RL agentsby adding an adversarial learning layer which is expected to improverobustness. The theoretical results show convergence results for RA2-Qand RAM-Q; and in the empirical results, RA2.1-Q and RA3-Q obtainedbetter results in a simplified trading scenario.

Risk-Averse Q-Learning (RAQL) and Proof of Convergence

FIG. 8 shows an example risk-averse Q-learning algorithm (RAQL [28];Algorithm 5). In particular, the Q-table is updated be Eq. (17) below:

$\begin{matrix}{{Q_{t + 1}\left( {s_{t},a_{t}} \right)} = {{Q_{t}\left( {s_{t},a_{t}} \right)} + {{\alpha_{t}\left( {s_{t},a_{t}} \right)} \cdot \left\lbrack {{u\left( {r_{t} + {{\gamma \cdot \max\limits_{a}}{Q_{t}\left( {s_{t + 1},a} \right)}} - {Q_{t}\left( {s_{t},a_{t}} \right)}} \right)} - x_{0}} \right\rbrack}}} & (17)\end{matrix}$

where u is a utility function, u(x)=−e^(βx) where β<0; x₀=−1.

As proven by [28], Lemma A.2. for the iterative procedure

$\begin{matrix}{{Q_{t + 1}\left( {s_{t},a_{t}} \right)} = {{Q_{t}\left( {s_{t},a_{t}} \right)} + {{\alpha_{t}\left( {s_{t},a_{t}} \right)}\left\lbrack {{u\left( {r_{t} + {{\gamma \cdot \max\limits_{a}}{Q_{t}\left( {s_{t + 1},a} \right)}} - {Q_{t}\left( {s_{t},a_{t}} \right)}} \right)} - x_{0}} \right\rbrack}}} & (18)\end{matrix}$

where a_(t)≥0 satisfy that for any (s, a), Σ_(t=0) ^(∞)α_(t)(s, a)=∞;and Σ_(t=0) ^(∞)α_(t) ²(s, a)<∞, then Q_(t)→Q*, where Q* is the solutionof the Bellman equation

$\begin{matrix}{{\left( {H^{A}Q^{*}} \right)\left( {s,a} \right)} = {{{\alpha \cdot {{\mathbb{E}}_{s,a}\left\lbrack {\overset{\sim}{u}\left( {r_{t} + {{\gamma \cdot \max\limits_{a}}{Q^{*}\left( {s_{t + 1},a} \right)}} - {Q^{*}\left( {s,a} \right)}} \right)} \right\rbrack}} + {Q^{*}\left( {s,a} \right)}} = {{Q^{*}\left( {s,a} \right)}{\forall\left( {s,a} \right)}}}} & (19)\end{matrix}$

If Lemma A.2 is true, then it is shown in [28] that the correspondingpolicy optimizes the objective function Eq. (7).

Before proving convergence in Lemma A.2, a more general update rule isdiscussed:

q _(t+1)(i)=(1−α_(t)(i))q _(t)(i)+α_(t)(i)[(Hq _(t))(i)+w _(t)(i)]  (20)

where i is the independent variable (e.g., in single agent Q learning,it's the state-action pair (s, a)), q_(t) ∈

^(d), H:

^(d)→

_(d) is an operator, w_(t) denotes some random noise term and α_(t) islearning rate with the understanding that α_(t)(i)=0 if q(i) is notupdated at time t. Denote by

_(t) the history of the algorithm up to time t,

_(t) ={q ₀(i), . . . ,q _(t)(i),w ₀(i), . . . ,w _(t)(i),α₀(i), . . . ,α_(t)(i)}  (21)

Recall the following essential proposition:

Proposition 1 [6] Let q_(t) be the sequence generated by the iterationEq. (20), if the following assumption is true:

(a). The learning rates α_(t)(i) satisfy:

α_(t)(i)≥0; Σ_(t=0) ^(∞)α_(t)(i)=∞; Σ_(t=0) ^(∞)α_(t) ²(i)<∞; ∀i  (22)

(b). The noise terms w_(t)(i) satisfy

-   -   (i)        [w_(t)(i)|        _(t)]=0 for all i and t;    -   (ii) there exist constants A and B such that        [w_(t) ²(i)|        _(t)]≤A+B∥q_(t)∥² for some norm ∥⋅∥ on        ^(d).

(c). The mapping H is a contraction under sup-norm.

Then q_(t) converges to the unique solution q* of the equation Hq*=q*with probability 1.

In order to apply Proposition 1, the update rule Eq. (9) is formulatedby letting

$\begin{matrix}{{{q_{t + 1}\left( {s,a} \right)} = {{\left( {1 - \frac{\alpha_{t}\left( {s,a} \right)}{\alpha}} \right){q_{t}\left( {s,a} \right)}} + {{\frac{\alpha_{t}\left( {s,a} \right)}{\alpha}\left\lbrack {{\alpha \cdot {u\left( d_{t} \right)}} - {\alpha \cdot x_{0}} + {q_{t}\left( {s,a} \right)}} \right\rbrack}{where}}}}{{{\overset{\sim}{u}(x)}:={{u(x)} - x_{0}}};{{d_{t}:} = {r_{t} + {{\gamma \cdot \max\limits_{a}}{q_{t}\left( {s_{t + 1},a} \right)}} - {{q_{t}\left( {s,a} \right)}.}}}}} & (23)\end{matrix}$

And the following is set:

$\begin{matrix}{{\left( {Hq_{t}} \right)\left( {s,a} \right)} = {{\alpha \cdot {{\mathbb{E}}_{s,a}\left\lbrack {\overset{\sim}{u}\left( {r_{t} + {{\gamma \cdot \max\limits_{a}}{q_{t}\left( {s_{t + 1},a} \right)}} - {q_{t}\left( {s,a} \right)}} \right)} \right\rbrack}} + {q_{t}\left( {s,a} \right)}}} & (24)\end{matrix}$ $\begin{matrix}{{w_{t}\left( {s,a} \right)} = {{\alpha \cdot {\overset{˜}{u}\left( d_{t} \right)}} - {\alpha \cdot {{\mathbb{E}}_{s,a}\left\lbrack {\overset{˜}{u}\left( {r_{t} + {{\gamma \cdot \max\limits_{a}}{q_{t}\left( {s^{\prime}\ ,a} \right)}} - {q_{t}\left( {s,a} \right)}} \right)} \right\rbrack}}}} & (25)\end{matrix}$

where s′ is sampled from

[·|s,a].

More explicitly, Hq is defined as

$\begin{matrix}{{\left( {Hq} \right)\left( {s,a} \right)} = {{\alpha \cdot {\sum\limits_{s^{\prime}}{{\mathcal{T}\left\lbrack {\left. s^{\prime} \middle| s \right.,a} \right\rbrack} \cdot {\overset{\sim}{u}\left( {{r\left( {s,a} \right)} + {{\gamma \cdot \max\limits_{a^{\prime}}}\ {q\left( {s^{\prime}\ ,a^{\prime}} \right)}} - {q\left( {s,a} \right)}} \right)}}}} + {q\left( {s,a} \right)}}} & (26)\end{matrix}$

Next, it is shown that H is a contraction under sup-norm.

The utility function is assumed to satisfy:

Assumption A.1

i. The utility function u is strictly increasing and there exists somey₀ ∈

such that u(y₀)=x₀.

ii. There exist positive constants ϵ, L such that

${0 <} \in {\leq \frac{{u(x)} - {u(y)}}{x - y} \leq L}$

for all x≠y ∈

.

Assumption A.1 appears to exclude several important types of utilityfunctions such as the exponential function u(x)=exp(c·x) since it doesnot satisfy the global Lipschitz. But this can be solved by a truncationwhen x is very large and by an approximation when x is very close to 0.

In addition, the immediate reward r_(t) is assumed to always satisfy:

Assumption A.2 r_(t) is uniformly sub-Gaussian over t with varianceproxy σ², i.e.,

$\begin{matrix}{{{\mathbb{E}}\left\lbrack r_{t} \right\rbrack} = 0} & (27)\end{matrix}$ $\begin{matrix}\begin{matrix}{{{\mathbb{E}}\left\lbrack {\exp\left( {c \cdot r_{t}} \right)} \right\rbrack} \leq {\exp\left( \frac{\sigma^{2}c^{2}}{2} \right)}} & {\forall{c \in {\mathbb{R}}}}\end{matrix} & (28)\end{matrix}$

Proposition 2 Suppose that Assumption A.1 and Assumption A.2 hold and0<α<min(L⁻¹, 1). Then there exists a real number α ∈ [0,1) such that forall q, q′ ∈

^(d), ∥Hq−Hq′∥_(∞)≤α∥q−q′∥_(∞).

$\begin{matrix}{{{{{{Proof}.{Define}}{}{v(s)}}:} = {{{\max\limits_{a}{q\left( {s,a} \right)}{and}{v^{\prime}(s)}}:} = {\max\limits_{a}{{q^{\prime}\left( {s,a} \right)}.{Thus}}}}},{{{❘{{v(s)} - {{v'}(s)}}❘} \leq {\max\limits_{s,a}{❘{{q\left( {s,a} \right)} - {{q'}\left( {s,a} \right)}}❘}}} = {{q - q^{\prime}}}_{\infty}}} & (29)\end{matrix}$

By Assumption A.1, and the monotonicity of ũ, there exists a ξ_((x,y)) ∈[ϵ, L] such that ũ(x)−ũ(y)=ξ_((x,y))·(x−y). Then the following can beobtained:

$\begin{matrix}{{\left( {Hq} \right)\left( {s,a} \right)} - {\left( {Hq}^{\prime} \right)\left( {s,a} \right)}} & (30)\end{matrix}$ $\begin{matrix}{= {\sum\limits_{s^{\prime}}{{\mathcal{T}\left\lbrack {\left. s^{\prime} \middle| s \right.,a} \right\rbrack} \cdot \left\{ {{{\alpha\xi}_{({s,a,s^{\prime},q,q^{\prime}})} \cdot \left\lbrack {{\gamma{v\left( s^{\prime} \right)}} - {\gamma{v^{\prime}\left( s^{\prime} \right)}} - {q\left( {s,a} \right)} + {q^{\prime}\left( {s,a} \right)}} \right\rbrack} + \left( {{q\left( {s,a} \right)} - {q^{\prime}\left( {s,a} \right)}} \right)} \right\}}}} & (31)\end{matrix}$ $\begin{matrix}{\leq {{\left( {1 - {{\alpha\left( {1 - \gamma} \right)}{\sum_{s}{{/{T\left\lbrack {\left. {s\prime} \middle| s \right.,\ a} \right\rbrack}}\cdot{{\zeta/}/}}}}} \right)\| q} - {q^{1}\|_{\infty}}}} & (32)\end{matrix}$ $\begin{matrix}{\leq {{\left( {1 - {{\alpha\left( {1 - \gamma} \right)}\varepsilon}} \right)\| q} - {q^{1}\|_{\infty}}}} & (33)\end{matrix}$

Hence, α=1−α(1−γ)ϵ is the required constant.

Now that it is shown that the requirements (a) and (c) of Proposition 1hold, it remains to check (b). By Eq.(24),

[w_(t)(s, a)|

_(t)]=0. Next, proof of (b)(ii) is presented.

$\begin{matrix}{{{\mathbb{E}}\left\lbrack {w_{t}^{2}\left( {s,a} \right)} \middle| \mathcal{F}_{t} \right\rbrack} = {{\alpha^{2}{{\mathbb{E}}\left\lbrack \left( {ũ\left( d_{t} \right)} \right)^{2} \middle| \mathcal{F}_{t} \right\rbrack}} - {\alpha^{2}\left( {{\mathbb{E}}\left\lbrack {\overset{˜}{u}\left( d_{t} \right)} \middle| \mathcal{F}_{t} \right\rbrack} \right)}^{2}}} & (34)\end{matrix}$ $\begin{matrix}{\leq {\alpha^{2}{{\mathbb{E}}\left\lbrack \left( {\overset{˜}{u}\left( d_{t} \right)} \right)^{2} \middle| \mathcal{F}_{t} \right\rbrack}}} & (35)\end{matrix}$

By Assumption A.2,

${{{\mathbb{E}}{❘r_{t}❘}} < {\left( {2\sigma} \right)^{\frac{1}{2}}{\Gamma\left( \frac{1}{2} \right)}}},$

where Γ(⋅) is the Gamma function (see [10] for details). The upper boundfor

[|r_(t)|] is denoted as R₁. Then

[|d_(t)|]≤R₁+2∥q_(t)∥_(∞), due to Assumption A.1, it implies that

[|ũ(d _(t))−ũ(0)|]≤

[L·d _(t)]≤L(R ₁+2∥q _(t)∥_(∞))  (36)

Hence by triangle inequality,

[|ũ(d _(t))|]≤ũ(0)+LR ₁+2L∥q _(t)∥_(∞)  (37)

And since

(a+b)²≤2a ²+2b ² ∀a,b∈

  (38)

it can be shown

(|ũ(0)|+LR ₁+2L∥q _(t)∥_(∞))²≤2(∥ũ(0)|+LR ₁)²+8L ² ∥q _(t)∥_(∞) ²  (39)

And since

$\begin{matrix}{{{\mathbb{E}}\left\lbrack \left( {{\overset{˜}{u}\left( d_{t} \right)} - {\overset{˜}{u}(0)}} \right)^{2} \middle| \mathcal{F}_{t} \right\rbrack} \leq {{\mathbb{E}}\left\lbrack {L \cdot d_{t}^{2}} \right\rbrack}} & (40)\end{matrix}$ $\begin{matrix}{= {{\mathbb{E}}\left\lbrack {L \cdot \left( {r_{t} + {{\gamma \cdot \max\limits_{a}}{q_{t}\left( {s^{\prime},a} \right)}} - {q_{t}\left( {s,a} \right)}} \right)^{2}} \right\rbrack}} & (41)\end{matrix}$ $\begin{matrix}{= {{\mathbb{E}}\left\lbrack {L \cdot \left( {r_{t}^{2} + {2{r_{t} \cdot \left( {{{\gamma \cdot \max\limits_{a}}{q_{t}\left( {s^{\prime},a} \right)}} - {q_{t}\left( {s,a} \right)} + \left( {{{\gamma \cdot \max\limits_{a}}q_{t}{q_{t}\left( {s^{\prime},a} \right)}} - {q_{t}\left( {s,a} \right)}} \right)^{2}} \right)}}} \right.} \right\rbrack}} & (42)\end{matrix}$ $\begin{matrix}{= {{LR_{2}} + {2L{{R_{1}\left( {1 - \gamma} \right)} \cdot {q_{t}}_{\infty}}} + {{L\left( {1 - \gamma} \right)}^{2} \cdot {q_{t}}_{\infty}^{2}}}} & (43)\end{matrix}$

where R₂ is the upper bound for

[r_(t) ²] due to Assumption A.2 (

[r_(t) ²]≤4σ²·Γ(1) [10]).

Note that here ũ(0)=0, therefore:

α²

[(ũ(d _(t)))²|

_(t)]≤α²·(LR ₂+2LR ₁(1−γ)·∥q _(t)∥_(∞) +L(1−γ)² ·∥q _(t)∥_(∞) ²)  (44)

hence,

[w _(t) ²(s,a)|

_(t)]≤2α²·(LR ₂+2LR ₁(1−γ)·∥q _(t)∥_(∞) +L(1−γ)² ·∥q _(t)∥_(∞) ²  (45)

if ∥q_(t)∥_(∞)≤1, then

[w _(t) ²(s,a)|

_(t)]≤2α²·(LR ₂+2LR ₁(1−γ)+L(1−γ)·∥q _(t)∥_(∞) ²)  (46)

if ∥q_(t)∥_(∞)>1, then

[w _(t) ²(s,a)|

_(t)]≤2α²·(LR ₂+(2LR ₁(1−γ)+L(1−γ)²)·∥q _(t)∥_(∞) ²)  (47)

It has been shown that q_(t) satisfy all of the requirements inProposition 1, so q_(t)→q* with probability 1.

Nash-Q Learning Algorithm

This sub-section describes the Nash-Q Learning Algorithm [16] and itsconvergence guarantees. Assumption B.3 below will also be used in proofof RAM-Q. FIG. 9 shows an example Nash Q-learning algorithm (Algorithm6) for Agent A: for ∀(s, a_(A), a_(B)), the training system initializesQ_(A) ¹(s, a_(A), a_(B))=0; Q_(A) ²(s, a_(A), a_(B))=0; N_(A)(s, a_(A),a_(B))=0.

From t=1 to T, for each value of t: at state s_(t), the training systemcompute π_(A) ¹(s_(t)), which is a mixed strategy Nash equilibriumsolution of the bimatrix game (Q_(A) ¹(s_(t)), Q_(A) ²(s_(t))). Thesystem can select an action at a_(t) ^(A) based on π_(A) ¹(s_(t))according to ϵ-greedy strategy, then observe and compute r_(t)^(A),r_(t) ^(B), a_(t) ^(B) and s_(t+1). At state s_(t+1), the trainingsystem computes π_(A) ¹(s_(t+1)), π_(A) ²(s_(t+1)), which are mixedstrategies Nash equilibrium solution of the bimatrix game (Q_(A)¹(s_(t+1)), Q_(A) ²(s_(t+1))). The training system then updatesN_(A)(s_(t),a_(t) ^(A), a_(t) ^(B))=N_(A)(s_(t),a_(t) ^(A), a_(t)^(B))+1 and sets learning rate

$\alpha_{t}^{A} = {\frac{1}{N_{A}\left( {s_{t},a_{t}^{A},a_{t}^{B}} \right)}.}$

The system can update Q_(A) ¹, Q_(A) ² such that

Q _(A) ¹(s _(t) ,a _(t) ^(A) ,a _(t) ^(B))=(1−α_(t) ^(A))·Q _(A) ¹(s_(t) ,a _(t) ^(A) ,a _(t) ^(B))+α_(t) ^(A)·[r _(t) ^(A)+γ·π_(A) ¹(s_(t+1))Q _(A) ¹(s _(t+1))π_(A) ²(s _(t+1))]

Q _(A) ²(s _(t) ,a _(t) ^(A) ,a _(t) ^(B))=(1−α_(t) ^(A))·Q _(A) ²(s_(t) ,a _(t) ^(A) ,a _(t) ^(B))+α_(t) ^(A)·[r _(t) ^(B)+γ·π_(A) ¹(s_(t+1))Q _(A) ²(s _(t+1))π_(A) ²(s _(t+1))]

Assumption B.3 [16] A Nash equilibrium (π¹(s), π²(s)) for any bimatrixgame (Q¹(s), Q²(s)) during the training process satisfies one of thefollowing properties:

1. The Nash equilibrium is global optimal.

π¹(s)Q ^(k)(s)π²(s)≥{circumflex over (π)}¹(s)Q ^(k)(s){circumflex over(π)}²(s) ∀{circumflex over (π)}¹(s),{circumflex over (π)}²(s), andk=1,2  (48)

2. If the Nash equilibrium is not a global optimal, then an agentreceives a higher payoff when the other agent deviates from the Nashequilibrium strategy.

π¹(s)Q ¹(s)π²(s)≤π¹(s)Q ¹(s){circumflex over (π)}²(s) ∀{circumflex over(π)}²(s)  (49)

π¹(s)Q ^(k)(s)π²(s)≥{circumflex over (π)}¹(s)Q ^(k)(s){circumflex over(π)}²(s) ∀{circumflex over (π)}¹(s)  (50)

Theorem 5 (Theorem 4, [16]) under Assumption B.3, the coupled sequencesQ_(A) ¹, Q_(A) ² updated by Algorithm 6, converge to the Nashequilibrium Q values (Q_(*) ¹,Q_(*) ²), with Q_(*) ^(k) (k=1,2) definedas

Q _(*) ¹(s,a ^(A) ,a ^(B))=r ^(A)(s,a ^(A) ,a ^(B))+γ·

[J ^(A)(s′,π _(*) ^(A),π_(*) ^(B))]  (51)

Q _(*) ²(s,a ^(A) ,a ^(B))=r ^(B)(s,a ^(A) ,a ^(B))+γ·

[J ^(B)(s′,π _(*) ^(A),π_(*) ^(B))]  (52)

where (π_(*) ^(A),π_(*) ^(B)) is a Nash equilibrium solution for thisstochastic game (J^(A),J^(B)) and

J ^(A)(s′,π _(*) ^(A),π_(*) ^(B))=Σ_(t=0) ^(∞)γ^(t)

[r _(t) ^(A)|π_(*) ^(A),π_(*) ^(B) ,s ₀ =s′]  (53)

J ^(B)(s′,π _(*) ^(A),π_(*) ^(B))=Σ_(t=0) ^(∞)γ^(t)

[r _(t) ^(B)|π_(*) ^(A),π_(*) ^(B) ,s ₀ =s′]  (54)

Proof of Theorem 2

Poisson masks M˜Poisson(1) provides parallel learning since

$\left. {{Binomial}\left( {T,\frac{1}{T}} \right)}\rightarrow{{Poisson}(1)} \right.$

as T→∞, so each Q table Q^(i) is trained in parallel. The proof ofconvergence of Q^(i) for all i ∈ {1, . . . , k} is shown in the Proof ofTheorem 1 above. Hence

$\left. {\frac{1}{k}{\sum\limits_{i = 1}^{k}Q^{i}}}\rightarrow{Q^{*}{w.p}{\text{.1}.}} \right.$

Proof of Convergence of RAM-Q

In this section, the convergence of Algorithm 3 (RAM-Q) is proven underAssumption B.3. The convergence proof is based on the following lemma.

Lemma D.3 (Conditional Averaging Lemma [30]) Assume the learning rateα_(t) satisfies Proposition 1(a). Then, the processQ_(t+1)(i)=(1−α_(t)(i))Q_(t)(i)+α_(t)w_(t)(i) converges to

[w_(t)(i)|h_(t),α_(t)], where h_(t) is the history at time t.

The proof of convergence of Q^(P) is shown as an example, and the proofof convergence of Q^(A) is the same. First, the the update rule Eq. (11)is reformulated as:

$\begin{matrix}{{Q^{P}\left( {s_{t},a_{t}^{P},a_{t}^{A}} \right)} = {{\left( {1 - \frac{\alpha_{t}}{\alpha}} \right) \cdot {Q^{P}\left( {s_{t},a_{t}^{P},a_{t}^{A}} \right)}} + {\frac{\alpha_{t}}{\alpha}.\ \text{ }\left\lbrack {{\alpha \cdot {u^{P}\left( {r_{t}^{P} + {{\gamma \cdot {\pi^{P}\left( s_{t + 1} \right)}}{Q^{P}\left( s_{t + 1} \right)}{\pi^{A}\left( s_{t + 1} \right)}} - {Q^{P}\left( {s_{t},a_{t}^{P},a_{t}^{A}} \right)}} \right)}} - {\alpha \cdot x_{0}} +} \right.}}} & (55)\end{matrix}$ $\begin{matrix}\left. {Q^{P}\left( {s_{t},a_{t}^{P},a_{t}^{A}} \right)} \right\rbrack & (56)\end{matrix}$

set

(H ^(P) Q ^(P))(s _(t) ,a _(t) ^(P) ,a _(t) ^(A))=α·u ^(P)(r _(t)^(P)+γ·π^(P)(s _(t+1))Q ^(P)(s _(t+1))π^(A)(s _(t+1))−Q ^(P)(s _(t) ,a_(t) ^(P) ,a _(t) ^(A)))−α·x ₀ +Q ^(P)(s _(t) ,a _(t) ^(P) ,a _(t)^(A))  (57)

and H^(A)Q^(A) is defined symmetrically as

(H ^(A) Q ^(A))(s _(t) ,a _(t) ^(P) ,a _(t) ^(A))=α·u ^(A)(r _(t)^(A)+γ·π^(P)(s _(t+1))Q ^(A)(s _(t+1))π^(A)(s _(t+1))−Q ^(A)(s _(t) ,a_(t) ^(P) ,a _(t) ^(A)))−α·x ₁ +Q ^(A)(s _(t) ,a _(t) ^(P) ,a _(t)^(A))  (58)

It's shown in [16] that the operator (M_(t) ^(P), M_(t) ^(A)) is aγ-contraction mapping where (M_(t) ^(P), M_(t) ^(A)) is defined as

M _(t) ^(P) Q ^(P)(s)=r _(t) ^(P)+γ·π^(P)(s)Q ^(P)(s)π^(A)(s)  (59)

M _(t) ^(A) Q ^(A)(s)=r _(t) ^(A)+γ·π^(P)(s)Q ^(A)(s)π^(A)(s)  (60)

Next, it is shown that (H^(P), H^(A)) is a contraction under sup-norm(under Assumption A.1).

$\begin{matrix}{{{H^{P}Q^{P}} - {H^{P}{\overset{\hat{}}{Q}}^{P}}} = {{\alpha \cdot \left\lbrack {\xi_{Q^{P},{\hat{Q}}^{P}}^{P} \cdot \left( {{M^{P}Q^{P}} - {M^{P}{\overset{\hat{}}{Q}}^{P}} - \left( {Q^{P} - {\overset{\hat{}}{Q}}^{P}} \right)} \right)} \right\rbrack} + \left( {Q^{P} - {\overset{\hat{}}{Q}}^{P}} \right)}} & (61)\end{matrix}$ $\begin{matrix}{\leq {{\alpha \cdot \left\lbrack {{\xi_{Q^{P},{\hat{Q}}^{P}}^{P} \cdot \left( {\gamma - 1} \right)}{{Q^{P} - {\overset{\hat{}}{Q}}^{P}}}_{\infty}} \right\rbrack} + {{Q^{P} - {\overset{\hat{}}{Q}}^{P}}}_{\infty}}} & (62)\end{matrix}$ $\begin{matrix}{\leq {\left( {{1 - \alpha} \in \left( {1 - \gamma} \right)} \right) \cdot {{Q^{P} - {\overset{\hat{}}{Q}}^{P}}}_{\infty}}} & (63)\end{matrix}$

Similarly, H^(A)Q^(A)−H^(A){circumflex over(Q)}^(A)≤(1−αϵ(1−γ))·∥Q^(A)−{circumflex over (Q)}^(A)∥_(∞).

Hence (H^(P), H^(A)) is a (1−αϵ(1−γ))-contraction under sup-norm. ByLemma D.3, the update rules Eqs. (11) and (12) respectively converges to

Q ^(P)(s _(t) ,a _(t) ^(P) ,a _(t) ^(A))→

[α·u ^(P)(r _(t) ^(P)+γ·π^(P)(s _(t+1))Q ^(P)(s _(t+1))π^(A)(s _(t+1))−Q^(P)(s _(t) ,a _(t) ^(P) ,a _(t) ^(A)))−α·x ₀ Q ^(P)(s _(t) ,a _(t) ^(P),a _(t) ^(A))]  (64)

Q ^(A)(s _(t) ,a _(t) ^(P) ,a _(t) ^(A))→

[α·u ^(A)(r _(t) ^(A)+γ·π^(P)(s _(t+1))Q ^(A)(s _(t+1))π^(A)(s _(t+1))−Q^(A)(s _(t) ,a _(t) ^(P) ,a _(t) ^(A)))−α·x ₀ Q ^(A)(s _(t) ,a _(t) ^(P),a _(t) ^(A))]  (64)

i.e., Eqs. (11) and (12) respectively converges to Q*_(P), Q*^(A), whereQ*_(P), Q*^(A) are the solution to the Bellman equations

_(s,a) _(P) _(,a) _(A) [u ^(P)(r ^(P)(s,a ^(P) ,a ^(A))+γ·π^(P)*(s′)Q_(P)*(s′)π^(A)*(s′)−Q _(P)*(s,a ^(P) ,a ^(A)))]=x ₀  (66)

_(s,a) _(P) _(,a) _(A) [u ^(A)(r ^(A)(s,a ^(P) ,a ^(A))+γ·π^(P)*(s′)Q_(A)*(s′)π^(A)*(s′)−Q _(A)*(s,a ^(P) ,a ^(A)))]=x ₁  (68)

where (π^(P)*,π^(A)*) is the Nash equilibrium solution to the bimatrixgame (Q*_(P), Q*_(A)).

Next it is shown that (π^(P)*,π^(A)*) is a Nash equilibrium solution forthe game with equilibrium payoffs ({tilde over (J)}^(P)(s,π^(P)*,π^(A)*),{tilde over (J)}^(A)(s,π^(P)*,π^(A)*)). As in [28],for any X ∈

, define

^(P) (X|s, a^(P), a^(A)):

×

×

×

→

be a mapping (for brevity, could be written as

_(s,a) _(P) _(,a) _(A) defined by

_(s,a) _(P) _(,a) _(A) (X)=sup{m∈

|

_(s,a) _(P) _(,a) _(A) [u ^(P)(X−m)]≥x ₀}  (68)

Similar to [28, 32], suppose (π^(P), π^(A)) is a Nash equilibriumsolution to the game ({tilde over (J)}^(P)(s, π^(P), π^(A)), {tilde over(J)}^(A)(s, π^(P), π^(A))) then the payoffs {tilde over (J)}^(P)(s,π^(P), π^(A)), {tilde over (J)}^(A)(s, π^(P), π^(A)) are the solution tothe risk-sensitive Bellman equations

{tilde over (J)} ^(P)(s,π ^(P),π^(A))=π^(P)(s)

_(s,a) _(P) _(,a) _(A) ^(P)(r ^(P)(s,:,:)+γ·{tilde over (J)} ^(P)(s′,π^(P),π^(A)))π^(A)(s)∀s∈

   (69)

{tilde over (J)} ^(A)(s,π ^(P),π^(A))=π^(P)(s)

_(s,a) _(P) _(,a) _(A) ^(P)(r ^(A)(s,:,:)+γ·{tilde over (J)} ^(A)(s′,π^(P),π^(A)))π^(A)(s)∀s∈

   (70)

And the corresponding Q tables satisfies

Q _(P)(s,a ^(P) ,a ^(A))=

_(s,a) _(P) _(,a) _(A) ^(P)(r ^(P)(s,a ^(P) ,a ^(A))+γ{tilde over (J)}^(P)(s′,π ^(P),π^(A)))  (71)

Q _(A)(s,a ^(P) ,a ^(A))=

_(s,a) _(P) _(,a) _(A) ^(P)(r ^(A)(s,a ^(P) ,a ^(A))+γ{tilde over (J)}^(A)(s′,π ^(P),π^(A)))  (72)

Note that

_(s,a) _(P) _(,a) _(A) ^(P) is monotonic one-to-one mapping, so as shownin Theorem 4.6.5 in [13], (π^(P), π^(A)) are the Nash equilibriumsolution to the bimatrix game (Q_(P), Q_(A)). Then if Q_(P)=Q_(P)*; andQ_(A)=Q_(A)* (i.e., Q_(P) and Q_(A) are the solution to Eq.(66)), thenthe Nash solution of the bimatrix game (Q_(P)*, Q^(A)*) returned byAlgorithm 3 (RAM-Q) will be the Nash solution for the game ({tilde over(J)}^(P), {tilde over (J)}^(A)).

[28] showed that Eq. (71) is equivalent to

_(s,a) _(P) _(,a) _(A) [u ^(P)(r ^(P)(s,a ^(P) ,a ^(A))+γ{tilde over(J)} ^(P)(s′,π ^(P),π^(A))−Q _(P)(s,a ^(P) ,a ^(A)))]=x ₀  (73)

_(s,a) _(P) _(,a) _(A) [u ^(A)(r ^(A)(s,a ^(P) ,a ^(A))+γ{tilde over(J)} ^(A)(s′,π ^(P),π^(A))−Q _(A)(s,a ^(P) ,a ^(A)))]=x ₁  (74)

Plugging Eq. (69),

_(s,a) _(P) _(,a) _(A) [u ^(P)(r ^(P)(s,a ^(P) ,a ^(A))+γ·π^(P) Q_(P)(s′)π^(A) −Q _(P)(s,a ^(P) ,a ^(A)))]=x ₀  (75)

_(s,a) _(P) _(,a) _(A) [u ^(A)(r ^(A)(s,a ^(P) ,a ^(A))+γ·π^(P) Q_(A)(s′)π^(A) −Q _(A)(s,a ^(P) ,a ^(A)))]=x ₁  (76)

which is exactly Eq. (66).

It has been shown that under Assumption B.3, Eq. (69) and Eq. (66) areequivalent. Hence Algorithm 3 (RAM-Q) converges to (Q_(P)*, Q_(A)*) s.t.the Nash equilibrium solution (π^(P)*, π^(A)*) for the bimatrix game(Q_(P)*, Q_(A)*) is the Nash equilibrium solution to the game and theequilibrium payoffs are {tilde over (J)}^(P) (s, π^(P)*, π^(A)*); {tildeover (J)}^(A) (s, π^(P)*, π^(A)*).

RA3-Q

Previously, a short version of RA3-Q is presented in Algorithm 4 (e.g.,FIG. 6 ), a detailed version is presented in Algorithm 7. FIGS. 10A and10B show an example risk-averse adversarial Q-learning algorithm (RA3-Q,Algorithm 7). The input data may include, training steps T; explorationrate E; number of models k; risk control parameters λ_(P),λ_(A); utilityfunction parameters β^(P)<0; β^(A)>0.

The training system first initializes Q_(P) ^(i)(s, a_(P), a_(A))=0;Q_(A) ^(i)(s, a_(P), a_(A))=0 for ∀i=1, . . . , k and (s, a_(A), a_(P));N=0 ∈

The training system then randomly samples action by choosing headintegers H_(P), H_(A) ∈ {1, . . . , k}.

From t=1 to T, for each value of t (“while in the t loop”): the trainingsystem sets Q_(P)=Q_(P) ^(H) ^(P) and computes {circumflex over (Q)}_(P)by:

$\begin{matrix}{{{{\overset{\hat{}}{Q}}_{P}\left( {s,a_{P},a_{A}} \right)} = {{{{Q_{P}\left( {s,a_{P},a_{A}} \right)} - {{\lambda_{P} \cdot \frac{\sum\limits_{i = 1}^{k}\left( {{Q_{P}^{i}\left( {s,a_{P},a_{A}} \right)} - {{\overset{\_}{Q}}_{P}\left( {s,a_{P},a_{A}} \right)}} \right)^{2}}{k - 1}}\lambda_{P}}} > {0{where}{}{{\overset{¯}{Q}}_{P}\left( {s,a_{P},a_{A}} \right)}}} = {\frac{1}{k}{\sum\limits_{i = 1}^{k}{Q_{P}^{i}\left( {s,a_{P},a_{A}} \right)}}}}};} & (77)\end{matrix}$

the training system sets Q_(A)=Q_(A) ^(H) ^(A) and computes {circumflexover (Q)}_(A) by

$\begin{matrix}{{{\overset{\hat{}}{Q}}_{A}\left( {s,a_{P},a_{A}} \right)} = {{{{Q_{A}\left( {s,a_{P},a_{A}} \right)} + {{\lambda_{A} \cdot \frac{\sum\limits_{i = 1}^{k}\left( {{Q_{A}^{i}\left( {s,a_{P},a_{A}} \right)} - {{\overset{\_}{Q}}_{A}\left( {s,a_{P},a_{A}} \right)}} \right)^{2}}{k - 1}}\lambda_{A}}} > {0{where}{}{{\overset{¯}{Q}}_{A}\left( {s,a_{P},a_{A}} \right)}}} = {\frac{1}{k}{\sum\limits_{i = 1}^{k}{{Q_{A}^{i}\left( {s,a_{P},a_{A}} \right)}.}}}}} & (78)\end{matrix}$

The optimal actions (a′_(P), a′_(A)) are defined as

$\begin{matrix}{{{\overset{\hat{}}{Q}}_{P}\left( {s_{t},a_{p}^{\prime},a_{A}^{0}} \right)} = {\max\limits_{a_{P},,a_{A}}{{\overset{\hat{}}{Q}}_{P}\left( {s_{t},a_{P},a_{A}} \right)}{for}{some}a_{A}^{0}}} & (79)\end{matrix}$ $\begin{matrix}{{{\overset{\hat{}}{Q}}_{A}\left( {s_{t},a_{p}^{0},a_{A}^{\prime}} \right)} = {\max\limits_{a_{P},a_{A}}{{\overset{\hat{}}{Q}}_{A}\left( {s_{t},a_{P},a_{A}} \right)}{for}{some}a_{P}^{0}}} & (80)\end{matrix}$

While in the t loop, the training system selects actions a_(P), a_(A)according to {circumflex over (Q)}_(P), {circumflex over (Q)}_(A) byapplying ϵ-greedy strategy. Two agents respectively execute actionsa_(P), a_(A) and observe (s_(t),a_(P), a_(A), r_(t) ^(A), r_(t) ^(P),s_(t+1))

While in the t loop, the training system generates mask M ∈

^(k)˜Poisson(1) and updates

${\alpha\left( {s_{t},{a_{P\prime}a_{A}}} \right)} = \frac{1}{N\left( {{s_{t}a_{P}},a_{A}} \right)}$

and N(s_(t),a_(P),a_(A))=N(s_(t),a_(P), a_(A))+1.

While in the t loop, from i=1, . . . , k, for each value of i (“while inthe i loop”): if and when M_(i)=1, the training system updates Q_(P)^(i) by

$\begin{matrix}{{Q_{P}^{i}\left( {s_{t},a_{P},a_{A}} \right)} = {{Q_{P}^{i}\left( {s_{t},{a_{P\prime}a_{A}}} \right)} + {{\alpha\left( {s_{t},a_{P},a_{A}} \right)} \cdot \left\lbrack {{u^{P}\left( {r_{t}^{P} + {{\gamma \cdot \max\limits_{a_{P},a_{A}}}{Q_{P}^{i}\left( {s_{t + 1},a_{P},a_{A}} \right)}} - {Q_{P}^{i}\left( {s_{t},a_{P},a_{A}} \right)}} \right)} - x_{0}} \right\rbrack}}} & (81)\end{matrix}$

where u^(P) is a monotonically increasing concave utility function,e.g., u^(P)(x)=−e^(β) ^(P) ^(x) where β^(P)<0; and x₀=−1.

While in the t loop, from i=1, . . . , k, for each value of i (“while inthe i loop”):”): if and when M_(i)=1, the training system updates Q_(A)^(i) by:

$\begin{matrix}{{Q_{A}^{i}\left( {s_{t},a_{P},a_{A}} \right)} = {{Q_{A}^{i}\left( {s_{t},{a_{P\prime}a_{A}}} \right)} + {{\alpha\left( {s_{t},a_{P},a_{A}} \right)} \cdot \left\lbrack {{u^{A}\left( {r_{t}^{A} + {{\gamma \cdot \max\limits_{a_{P},a_{A}}}{Q_{A}^{i}\left( {s_{t + 1},a_{P},a_{A}} \right)}} - {Q_{A}^{i}\left( {s_{t},a_{P},a_{A}} \right)}} \right)} - x_{1}} \right\rbrack}}} & (82)\end{matrix}$

where u^(A) is a monotonically increasing convex utility function, e.g.,u(x)=e^(β) ^(A) ^(·x) where β^(A)>0; x₁=1.

Once outside of the i loop, the training system updates H_(P) and H_(A)by randomly sampling integers from 1 to k.

Once outside of the t loop, the training system returns

${\frac{1}{k}{\sum\limits_{i = 1}^{k}Q_{P}^{i}}};{\frac{1}{k}{\sum\limits_{i = 1}^{k}{Q_{A}^{i}.}}}$

In this section, convergence issues of RA3-Q are discussed. First asimplified setting is shown where if the adversary's policy is a fixedpolicy π₀ ^(A), the update rule for protagonist agent Eq. (81) convergesto the optimal of J^(P)(s, :, π₀ ^(A)). Similarly, if the protagonist'spolicy is a fixed policy π₀ ^(P), the update rule for adversary agentEq. (82) converges to the optimal of J^(A)(s, π₀ ^(P), :).

Poisson masks M˜Poisson(1) provides parallel learning since

$\left. {{Binomial}\left( {T,\ \frac{1}{T}} \right)}\rightarrow{{Poisson}(1)} \right.$

as T→∞, so each Q table of protagonist/adversary, Q_(P) ^(i), Q_(A)^(i), are trained in parallel respectively.

First, the proof for the convergence of the iterative procedure isshown. The protagonist agent is used as an example, and the proof foradversary is similar.

Fix the policy for adversary, then according to Proposition 3.1 in [28],for any random variable X, the following statements are equivalent:

$\begin{matrix}{{\frac{1}{\beta^{P}}\log{{\mathbb{E}}_{\mu}\left\lbrack {\exp\left( {\beta^{P} \cdot X} \right)} \right\rbrack}} = m^{*}} & (i)\end{matrix}$ $\begin{matrix}{{{\mathbb{E}}_{\mu}\left\lbrack {u^{P}\left( {X - m^{*}} \right)} \right\rbrack} = x_{0}} & ({ii})\end{matrix}$

The above proposition is used in the following context to show that theconvergent point is the optimal of the objective function {tilde over(J)}^(P)(s, :, π₀ ^(A)).

Compared to RAQL (Algorithm 5), RA3-Q uses multi-agent extension of MDP,where the transition function is

:

×

×

→

. The update rule Eq. (81) can be reformulated by letting:

$\begin{matrix}{{q_{t + 1}^{P}\left( {s,a_{P},a_{A}} \right)} = {{\left( {1 - \frac{\alpha_{t}\left( {s,a_{P},a_{A}} \right)}{\alpha}} \right){q_{t}^{P}\left( {s,a_{P},a_{A}} \right)}} + {\frac{\alpha_{t}\left( {s,a_{P},a_{A}} \right)}{\alpha} \cdot \left\lbrack {{\alpha \cdot {u\left( d_{t} \right)}} - x_{0} + {q_{t}^{P}\left( {s,a_{P},a_{A}} \right)}} \right\rbrack}}} & (83)\end{matrix}$ $\begin{matrix}{\begin{matrix}{{{where}{d_{t}:}} = {r_{t}^{P} + {{\gamma \cdot \max\limits_{a_{P},a_{A}}}q_{t}^{P}\left( {s^{\prime},a_{P},a_{A}} \right)} - {q_{t}^{P}\left( {s,a_{P},a_{A}} \right)}}} & {x_{0} = {- 1}} & {\alpha \in} & \left. \left( {0,\ {\min\left( {L^{- 1},1} \right)}} \right. \right\rbrack\end{matrix}{{and}{set}}} & (84)\end{matrix}$ $\begin{matrix}{{\left( {H^{P}q_{t}^{P}} \right)\left( {s,a_{P},a_{A}} \right)} = {\alpha \cdot {{\mathbb{E}}_{s,a_{P},a_{A}}\left\lbrack {{\overset{\sim}{u}\left( {r_{t}^{P} + {{\gamma \cdot \max\limits_{a_{P},a_{A}}}\ {q_{t}^{P}\left( {s^{\prime},a_{P},a_{A}} \right)}} - \text{ }{q_{t}^{P}\left( {s,a_{P},a_{A}} \right)}} \right)} + {q_{t}^{P}\left( {s,\ {a_{P\prime}a_{A}}} \right)}} \right.}}} & (85)\end{matrix}$ $\begin{matrix}{{w_{t}\left( {s,a_{P},a_{A}} \right)} = {{\alpha \cdot {\overset{˜}{u}\left( d_{t} \right)}} - {\alpha \cdot {{\mathbb{E}}_{s,a_{P},a_{A}}\left\lbrack {\overset{\sim}{u}\left( {r_{r}^{P} + {{\gamma \cdot \max\limits_{a_{P},a_{A}}}\ {q_{t}^{P}\left( {s^{\prime},a_{P},a_{A}} \right)}} - {q_{t}^{P}\left( {s,a_{P},a_{A}} \right)}} \right)} \right.}}}} & (86)\end{matrix}$ $\begin{matrix}{{ũ(x)} = {{u(x)} - x_{0}}} & (87)\end{matrix}$

Next it is shown that H^(P) is a (1−α(1−γ)ϵ)-contractor under AssumptionA.1: for any two q tables q, q′, define

${{v^{P}(s)}:} = {{\max\limits_{a_{P},a_{A}}{q\left( {s,a_{P},a_{A}} \right)}{and}{}{v^{P^{\prime}}(s)}}:={\max\limits_{a_{P,}a_{A}}{{q^{\prime}\left( {s,a_{P},a_{A}} \right)}.}}}$

Thus,

$\begin{matrix}{{{❘{{v^{P}(s)} - {v^{P^{\prime}}(s)}}❘} \leq {\max\limits_{s,a_{P},a_{A}}{❘{{q\left( {s,a_{P},a_{A}} \right)} - {q^{\prime}\left( {s,a_{P},a_{A}} \right)}}❘}}} = {{q - q^{\prime}}}_{\infty}} & (88)\end{matrix}$

By Assumption A.1 and monotonicity of ũ, for given x, y ∈

, there exists ξ_((x,y)) ∈ [ϵ, L] such that

$\begin{matrix}{{{{{\overset{\sim}{u}(x)} - {\overset{\sim}{u}(y)}} = {\xi_{({x,y})} \cdot \left( {x - y} \right)}},{then}}{{\left( {H^{P}q} \right)\left( {s,a_{P},a_{A}} \right)} - {\left( {H^{P}q^{\prime}} \right)\left( {s,a_{P},a_{A}} \right)}}} & (89)\end{matrix}$ $\begin{matrix}{= {\sum\limits_{s^{\prime}}{{\mathcal{P}\left\lbrack {\left. s^{\prime} \middle| s \right.,a_{P},a_{A}} \right\rbrack} \cdot \left\{ {{{\alpha\xi}_{({s,a_{P},a_{A},s^{\prime},q,q^{\prime}})} \cdot \left\lbrack {{\gamma \cdot {v^{P}\left( s^{\prime} \right)}} - {\gamma \cdot {v^{P^{\prime}}\left( s^{\prime} \right)}} - {q\left( {s,a_{P},a_{A}} \right)} + {q^{\prime}\left( {s,a_{P},a_{A}} \right)}} \right\rbrack} + \left( {{q\left( {s,a_{P},a_{A}} \right)} - {q^{\prime}\left( {s,a_{P},a_{A}} \right)}} \right)} \right\}}}} & (90)\end{matrix}$ $\begin{matrix}{\leq {\left( {1 - {{\alpha\left( {1 - \gamma} \right)}{\sum\limits_{s^{\prime}}{{\mathcal{P}\left\lbrack {\left. s^{\prime} \middle| s \right.,a_{P},a_{A}} \right\rbrack} \cdot \xi_{({s,a_{P},a_{A},s^{\prime},q,q^{\prime}})}}}}} \right){{q - q^{\prime}}}_{\infty}}} & (91)\end{matrix}$ $\begin{matrix}{\leq {\left( {{1 - {\alpha\left( {1 - \gamma} \right)}} \in} \right){{q - q^{\prime}}}_{\infty}}} & (92)\end{matrix}$

Hence H^(P) is a contractor.

By Eq. (86),

[w_(t)(s, a_(P), a_(A))|

_(t)]=0. Hence it remains to prove b(ii) in Proposition 1.

[w _(t) ²(s,a _(P) ,a _(A))|

_(t)]=α²·

[(ũ(d _(t)))²|

_(t)]−α²(

[ũ(d _(t))|

_(t)])²≤α²·

[(ũ(d _(t)))² |F _(t)]  (93)

Following from the same procedures as proof for Theorom 1, conditionb(ii) of Proposition 1 also holds in this case. As the learning ratesatisfies condition (a), by Proposition 1, q→q*, where q* is thesolution to the Bellman equation

$\begin{matrix}\begin{matrix}{{{\mathbb{E}}_{s,a_{P},a_{A}}\left\lbrack {u^{P}\ \left( {r_{t}^{P} + {{\gamma \cdot \max\limits_{a_{P},a_{A}}}\ q\left( {s^{\prime},a_{P},a_{A}} \right)} - \text{ }{q\left( {s,a_{P},a_{A}} \right)}} \right)} \right\rbrack} = x_{0}} & {\pi_{0}^{A}{is}{fixed}}\end{matrix} & (94)\end{matrix}$

for ∀(s, a_(P), a_(A)), where s′ is sampled from

[⋅|s, a_(P), a_(A)].

Similarly, it can be shown that for a fixed policy for protagonistagent, the update rule Eq. (82) will guarantee that q_(P)→q_(P)*, whereq_(P)* is the solution to the Bellman equation

$\begin{matrix}\begin{matrix}{{{\mathbb{E}}_{s,a_{P},a_{A}}\left\lbrack {u^{A}\ \left( {r_{t}^{A} + {{\gamma \cdot \max\limits_{a_{P},a_{A}}}\ q_{P}\left( {s^{\prime},a_{P},a_{A}} \right)} - \text{ }{q_{P}\left( {s,a_{P},a_{A}} \right)}} \right)} \right\rbrack} = x_{1}} & {\pi_{0}^{P}{is}{fixed}}\end{matrix} & (95)\end{matrix}$

for ∀(s, a_(P), a_(A)), where s′ is sampled from

[⋅|s, a_(P), a_(A)].

This does not imply a convergence guarantee of RA3-Q because of theprotagonist/adversary's policy is fixed assumption. Only if one of theagents (e.g., the protagonist) stops learning (and its policy becomesfixed) at some point, then the other agent (adversary) will alsoconverge. Note that in the general multi-agent learning case this is achallenge, and it is often hard to a balance between theoreticalalgorithms (with convergence guarantees) and practical algorithms(loosing guarantees but with good empirical results), as shown in theexperimental results above.

Meta-Game Payoff Examples and EGT Plots

Table 4 shows a payoff table of rock-paper-scissors game; itscorresponding directional field 800 is shown in FIG. 11A, and itstrajectory plot 810 is as shown in FIG. 11B. It can be observed fromFIGS. 11A and 11B that the equilibrium of Rock-Paper-Scissors is thecentroid of the strategies simplex.

TABLE 4 Payoff Table of Rock-Paper-Scissors N_(Rock) N_(Paper)N_(Scissors) R_(Rock) R_(Paper) R_(Scissors) 2 0 0 0 0 0 1 1 0 −1 1 0 02 0 0 0 0 1 0 1 1 0 −1 0 0 2 0 0 0 0 1 1 0 −1 1

Another example of a two-player meta-game payoff table of threestrategies is in Table 5 with its corresponding directional field 900 asshown in FIG. 12A and its trajectory plot 910 in FIG. 12B, where thewhite circles denote unstable equilibria (saddle points) and black solidcircles denote globally stable equilibriums.

TABLE 5 An example of a meta game payoff table of 2 players, 3strategies. N_(i1) N_(i2) N_(i3) R_(i1) R_(i2) R_(i3) 2 0 0 0.5 0 0 1 10 0.3 0.7 0 0 2 0 0 0.9 0 1 0 1 0.35 0 0.45 0 0 2 0 0 0.6 0 1 1 0 0.660.38

Proof of Theorem 4

Theorem 6. For a Normal Form Game with p players, and each player ichooses a strategy π^(i) from a set of strategies S^(i)={π₁ ^(i), . . ., π_(k) ^(i)}, and receives a risk averse payoff h^(i)(π¹, . . . ,π^(P)): S¹× . . . ×S^(P)→

satisfying Assumption G.4. If x is a Nash Equilibrium for the game ĥ^(i)(π¹, . . . , π^(P)), then it is a 2ϵ-Nash equilibrium for the gameh^(i)(π¹, . . . , π^(P)) with probability 1-δ if the game is played ntimes, where

$\begin{matrix}{n \geq {\max\left\{ {{{- \frac{8R^{2}}{\in^{2}}}{\log\left\lbrack {\frac{1}{4}\left( {1 - \left( {1 - \delta} \right)^{\frac{1}{{{|s^{1}|{\times \ldots \times}|s^{p}}❘} \times p}}} \right)} \right\rbrack}},\text{ }\frac{64\beta^{2}{\omega^{2} \cdot {\Gamma(2)}}}{\in^{2}\left\lbrack {1 - \left( {1 - \delta} \right)^{\frac{1}{|s^{1}|{\times_{\ldots} \times}|s^{p}|{\times p}}}} \right\rbrack}} \right\}}} & (96)\end{matrix}$

Assumption G.4 The stochastic return h (for each player and eachstrategy) for each simulation has a sub-Gaussian tail. i,e, there existsω>0 s.t.

$\begin{matrix}{{{\begin{matrix}{{{\mathbb{E}}\left\lbrack {\exp\left( {c \cdot \left( {h - {{\mathbb{E}}\lbrack h\rbrack}} \right)} \right)} \right\rbrack} \leq {\exp\left( \frac{\omega^{2}c^{2}}{2} \right)}} & {\ {\forall{c \in {\mathbb{R}}}}}\end{matrix}{and}R} > {0{s.t.h}}} \in {\left\lbrack {{- R},R} \right\rbrack.}} & (97)\end{matrix}$

Proof. Note that we have the following relation:

$\begin{matrix}{{{{\mathbb{E}}_{\pi \sim x}\left\lbrack {h^{i}(\pi)} \right\rbrack} = {{{\mathbb{E}}_{\pi \sim x}\left\lbrack {{\overset{\hat{}}{h}}^{i}(\pi)} \right\rbrack} + {{\mathbb{E}}_{\pi - x}\left\lbrack {{h^{i}(\pi)} - {{\overset{\hat{}}{h}}^{i}(\pi)}} \right\rbrack}}}{Then}} & (98)\end{matrix}$ $\begin{matrix}{{{\mathbb{E}}_{\pi^{- i} \sim x^{- i}}\left\lbrack {h^{i}\left( {\pi^{i},\ \pi^{- i}} \right)} \right\rbrack} = {{{{\mathbb{E}}_{\pi^{- i} \sim x^{- i}}\left\lbrack {{\overset{\hat{}}{h}}^{i}\left( {\pi^{i},\ \pi^{- i}} \right)} \right\rbrack} + {{{\mathbb{E}}_{\pi^{- i} \sim x^{- i}}\left\lbrack {{h^{i}\left( {\pi^{i},\ \pi^{- i}} \right)} - {{\overset{\hat{}}{h}}^{i}\left( {\pi^{i},\ \pi^{- i}} \right)}} \right\rbrack}\max\limits_{\pi^{i}}\ {{\mathbb{E}}_{\pi^{- i} \sim x^{- i}}\left\lbrack {h^{i}\left( {\pi^{i},\ \pi^{- i}} \right)} \right\rbrack}}} \leq {{\underset{\pi^{i}}{\max}{{\mathbb{E}}_{\pi^{- i} \sim x^{- i}}\left\lbrack {{\overset{\hat{}}{h}}^{i}\left( {\pi^{i},\ \pi^{- i}} \right)} \right\rbrack}} + {\underset{\pi^{i}}{\max}{{\mathbb{E}}_{\pi^{- i} \sim x^{- i}}\left\lbrack {{h^{i}\left( {\pi^{i},\ \pi^{- i}} \right)} -} \right.}}}}} & (99)\end{matrix}$ $\begin{matrix}{\left. {{\overset{\hat{}}{h}}^{i}\left( {\pi^{i},\ \pi^{- i}} \right)} \right\rbrack{{Hence},}} & (100)\end{matrix}$ $\begin{matrix}{\leq {{\max\limits_{\pi^{i}}\ {{\mathbb{E}}_{\pi^{- i} \sim x^{- i}}\left\lbrack {h^{i}\left( {\pi^{i},\pi^{- i}} \right)} \right\rbrack}} - {{\mathbb{E}}_{\pi \sim x}\left\lbrack {h^{i}(\pi)} \right\rbrack}} \leq {\underset{= {0{since}x{is}a{Nash}{Equilibrium}{for}{}{\hat{h}}^{i}}}{\underset{︸}{{\underset{\pi^{i}}{\max}{{\mathbb{E}}_{\pi^{- i} \sim x^{- i}}\left\lbrack {{\hat{h}}^{i}\left( {\pi^{i},\pi^{- i}} \right)} \right\rbrack}} - {{\mathbb{E}}_{\pi \sim x}\left\lbrack {{\hat{h}}^{i}(\pi)} \right\rbrack}}} +}} & (101)\end{matrix}$ $\begin{matrix}{\underset{\leq \in}{\underset{︸}{\underset{\pi^{i}}{\max}{{\mathbb{E}}_{\pi^{- i} \sim x^{- i}}\left\lbrack {{h^{i}\left( {\pi^{i},\pi^{- i}} \right)} - {{\overset{\hat{}}{h}}^{i}\left( {\pi^{i},\pi^{- i}} \right)}} \right\rbrack}}} + \underset{\leq \in}{\underset{︸}{E_{\pi \sim x}\left\lbrack {{{\overset{\hat{}}{h}}^{i}(\pi)} - {h^{i}(\pi)}} \right\rbrack}}} & (102)\end{matrix}$

Hence, if the difference between |h^(i)(π)−ĥ^(i)(π)| can be controlleduniformly over players and actions, then an equilibrium for theempirical game is almost an equilibrium for the game defined by thereward function. The question is how many samples n are needed to assessthat a Nash equilibrium for ĥ is a 2ϵ-Nash equilibrium for h for a fixedconfidence δ and a fixed ϵ.

In the following, in short, player i and the joint strategy π=(π¹, . . ., π^(P)) for p players are fixed, and denote h^(i)=h^(i)(π),ĥ^(i)=ĥ^(i)(π). By Hoeffding inequality,

$\begin{matrix}{{{\mathbb{P}}\left\lbrack \left| {{\overset{¯}{R}}^{i} - {{\mathbb{E}}\left\lbrack R^{i} \right\rbrack}} \middle| {\geq \frac{\in}{2}} \right. \right\rbrack} \leq {2 \cdot {\exp\left( {- \frac{\in^{2}n}{8R^{2}}} \right)}}} & (103)\end{matrix}$

Now, it remains to give a batch scenario for the unbiased estimator ofvariance penalty term. Denote

${V_{n}^{2} = {\frac{1}{n - 1}{\sum\limits_{j = 1}^{n}\left( {R_{j}^{i} - {\overset{¯}{R}}^{i}} \right)^{2}}}},$

then

[V_(n) ²]=

ar[R^(i)]=δ², i.e., it's an unbiased estimator of the game variance. Thevariance of V_(n) ² is computed first.

Let Z_(j) ^(i)=R_(j) ^(i)−

[R^(i)], then

[Z^(i)]=0 and Z₁ ^(i), . . . , Z_(n) ^(i) are independent. Then

$\begin{matrix}{V_{n}^{2} = {{{\mathbb{V}ar}\left\lbrack R^{i} \right\rbrack} = {{{\mathbb{V}ar}\left\lbrack Z^{i} \right\rbrack}.}}} & (104)\end{matrix}$ $\begin{matrix}{{{\mathbb{V}ar}\left\lbrack V_{n}^{2} \right\rbrack} = {{{\mathbb{E}}\left\lbrack V_{n}^{4} \right\rbrack} - \left( {{\mathbb{E}}\left\lbrack V_{n}^{2} \right\rbrack} \right)^{2}}} & (105)\end{matrix}$ $\begin{matrix}{= {{{\mathbb{E}}\left\lbrack \frac{{n^{2}\left( {\sum\limits_{j = 1}^{n}\left( Z_{j}^{i} \right)^{2}} \right)}^{2} - {2{n\left( {\sum\limits_{j = 1}^{n}\left( Z_{j}^{i} \right)^{2}} \right)}\left( {\sum\limits_{j = 1}^{n}Z_{j}^{i}} \right)^{2}} + {{\mathbb{E}}\left( {\sum\limits_{j = 1}^{n}Z_{j}^{i}} \right)}^{4}}{{n^{2}\left( {n - 1} \right)}^{2}} \right\rbrack} - \sigma^{4}}} & (106)\end{matrix}$ $\begin{matrix}{= {\frac{\left\lbrack {{n^{2}{{\mathbb{E}}\left( {\sum\limits_{j = 1}^{n}\left( Z_{j}^{i} \right)^{2}} \right)}^{2}} - {2n{{\mathbb{E}}\left( {\sum\limits_{j = 1}^{n}\left( Z_{j}^{i} \right)^{2}} \right)}\left( {\sum\limits_{j = 1}^{n}Z_{j}^{i}} \right)^{2}} + {{\mathbb{E}}\left( {\sum\limits_{j = 1}^{n}Z_{j}^{i}} \right)}^{4}} \right\rbrack}{{n^{2}\left( {n - 1} \right)}^{2}} - \sigma^{4}}} & (107)\end{matrix}$

Since Z₁ ^(i), . . . , Z_(n) ^(i) are independent, then for distinct j,k, m,

[Z _(j) ^(i) Z _(k) ^(i)]=0;

[(Z _(j) ^(i))³ Z _(k) ^(i)]=0;

[(Z _(j) ^(i))² Z _(k) ^(i) Z _(m) ^(i)]=0.  (108)

and denote

[(Z _(j) ^(i))²(Z _(k) ^(i))²]=μ₂ ²=δ⁴;

[(Z _(j) ^(i))⁴]=μ₄.  (109)

then, with algebraic manipulations, Eq. (105) can be simplified as:

$\begin{matrix}{\begin{matrix}{{\mathbb{V}ar}\left\lbrack V_{n}^{2} \right\rbrack} & {= \frac{\begin{matrix}{{n^{2}\left( {{n\mu_{4}} + {n\left( {n - 1} \right)\mu_{2}^{2}}} \right)} -} \\{{2n\left( {{n\mu_{4}} + {n\left( {n - 1} \right)\mu_{2}^{2}}} \right)} + {n\mu_{4}} + {3n\left( {n - 1} \right)\mu_{2}^{2}}}\end{matrix}}{{n^{2}\left( {n - 1} \right)}^{2}}}\end{matrix} - \sigma^{4}} & (110)\end{matrix}$ $\begin{matrix}{= {\frac{{\left( {n - 1} \right)\mu_{4}} + {\left( {n^{2} - {2n} + 3} \right)\sigma^{4}}}{n\left( {n - 1} \right)} - \sigma^{4}}} & (111)\end{matrix}$ $\begin{matrix}{= {\frac{\mu_{4}}{n} - {\frac{\sigma^{4}\left( {n - 3} \right)}{n\left( {n - 1} \right)}.}}} & (112)\end{matrix}$

by Chebyshev's inequality,

$\begin{matrix}{{{\mathbb{P}}\left\lbrack {{❘{V_{n}^{2} - {{\mathbb{V}ar}\left\lbrack R^{i} \right\rbrack}}❘} \geq \frac{\epsilon}{2\beta}} \right\rbrack} \leq \frac{{\mathbb{V}ar}\left\lbrack V_{n}^{2} \right\rbrack}{\left( \frac{\epsilon}{2\beta} \right)^{2}}} & (113)\end{matrix}$ $\begin{matrix}{\leq \frac{4{\beta^{2}\left( {\frac{\mu_{4}}{n} - \frac{\sigma^{4}\left( {n - 3} \right)}{n\left( {n - 1} \right)}} \right)}}{\epsilon^{2}}} & (114)\end{matrix}$

by Assumption G.4,

μ₄≤16ω²·Γ(2)  (115)

by triangle inequality,

$\begin{matrix}{{{\mathbb{P}}\left\lbrack {{❘{h^{i} - {\hat{h}}^{i}}❘} \geq \epsilon} \right\rbrack} \leq {{\mathbb{P}}\left\lbrack {{{❘{{{\mathbb{E}}\left\lbrack R^{i} \right\rbrack} - {\overset{\_}{R}}^{i}}❘} + {\beta \cdot {❘{V_{n}^{2} - {{\mathbb{V}ar}\left\lbrack R^{i} \right\rbrack}}❘}}} \geq \epsilon} \right\rbrack}} & (116)\end{matrix}$ $\begin{matrix}{\leq {{\mathbb{P}}\left\lbrack {{❘{{{\mathbb{E}}\left\lbrack R^{i} \right\rbrack} - {\overset{\_}{R}}^{i}}❘} \geq {\frac{\epsilon}{2}{or}{\beta \cdot {❘{V_{n}^{2} - {{\mathbb{V}ar}\left\lbrack R^{i} \right\rbrack}}❘}}} \geq \frac{\epsilon}{2}} \right\rbrack}} & (117)\end{matrix}$ $\begin{matrix}{\leq {{{\mathbb{P}}\left\lbrack {{❘{{{\mathbb{E}}\left\lbrack R^{i} \right\rbrack} - {\overset{\_}{R}}^{i}}❘} \geq \frac{\epsilon}{2}} \right\rbrack} + {{\mathbb{P}}\left\lbrack {{❘{V_{n}^{2} - {{\mathbb{V}ar}\left\lbrack R^{i} \right\rbrack}}❘} \geq \frac{\epsilon}{2\beta}} \right\rbrack}}} & (118)\end{matrix}$ $\begin{matrix}{\leq {{2 \cdot {\exp\left( {- \frac{\epsilon^{2}n}{8R^{2}}} \right)}} + \frac{4{\beta^{2}\left( {\frac{16{\omega^{2} \cdot {\Gamma(2)}}}{n} - \frac{\sigma^{4}\left( {n - 3} \right)}{n\left( {n - 1} \right)}} \right)}}{\epsilon^{2}}}} & (119)\end{matrix}$ $\begin{matrix}{\leq {{2 \cdot {\exp\left( {- \frac{\epsilon^{2}n}{8R^{2}}} \right)}} + \frac{64\beta^{2}{\omega^{2} \cdot {\Gamma(2)}}}{n\epsilon^{2}}}} & (120)\end{matrix}$ $\begin{matrix}{= {{f\left( {n,\epsilon} \right)}.}} & (121)\end{matrix}$

Therefore, for per joint strategies π and per player i, the followingbound exists:

$\begin{matrix}{{{\mathbb{P}}\left\lbrack {{\sup\limits_{\pi,i}{❘{{h^{i}(\pi)} - {{\hat{h}}^{i}(\pi)}}❘}} < \epsilon} \right\rbrack} \geq \left( {1 - {f\left( {n,\epsilon} \right)}} \right)^{{{{{{❘S^{1}❘} \times \ldots \times}❘}S^{P}}❘} \times p}} & (122)\end{matrix}$

hence, for

$\begin{matrix}{n \geq {\max\left\{ {{- \frac{8R^{2}}{\epsilon^{2}}{\log\left\lbrack {\frac{1}{4}\left( {1 - {\left( {1 - \delta} \right)\frac{1}{{{{{{❘S^{1}❘} \times \ldots \times}❘}S^{p}}❘} \times p}}} \right)} \right\rbrack}};\frac{64\beta^{2}{\omega^{2} \cdot {\Gamma(2)}}}{\epsilon^{2}\left\lbrack {1 - {\left( {1 - \delta} \right)\frac{1}{{❘S^{1}❘} \times \ldots \times {❘S^{p}❘} \times p}}} \right\rbrack}} \right\}}} & (123)\end{matrix}$

there is

${{\mathbb{P}}\left\lbrack {{\sup\limits_{\pi,i}{❘{{h^{i}(\pi)} - {{\hat{h}}^{i}(\pi)}}❘}} < \epsilon} \right\rbrack} \geq {1 - {\delta.}}$

Plugging the result into Eq. (101), it can be obtained:

$\begin{matrix}{{{\max\limits_{\pi^{i}}{{\mathbb{E}}_{\pi^{- i}\sim x^{- i}}\left\lbrack {h^{i}\left( {\pi^{i},\pi^{- i}} \right)} \right\rbrack}} - {{\mathbb{E}}_{\pi\sim x}\left\lbrack {h^{i}(\pi)} \right\rbrack}} \leq {2\epsilon}} & (124)\end{matrix}$

Adversarial Variance Reduced Q-Learning

In some embodiments, another Q-Learning algorithm is provided. Thesystem may receive input data including training epochs T; environmentenv; adversarial action schedule X; exploration rate ϵ; number of modelsk; epoch length K, recentering sample sizes {N_(m)}_(m≥1); utilityfunction parameter for protagonist β^(P)<0; and utility functionparameter for adversary β^(A)>0. The training system may initialize Q ₀^(P)=0; Q ₀ ^(A)=0; m_(P)=1; B=0 ∈

From t=1 to T, for each value of t: the system chooses Agent g from {A;P} according to X. Select action a_(t) according to Q _(m) _(g) ⁻¹ ^(g)by applying ϵ-greedy strategy. The system executes the selected actionand get (s_(e),a_(e), obs, reward, done); setRB_(g)=RB_(g)∪{(s_(e),a_(e), obs, reward, done)}.

From i=1, N for each value of i: the system defines

${{{\hat{\mathcal{J}}}_{i}(Q)}\left( {s,a} \right)} = {r + {{\gamma^{n} \cdot \max\limits_{a^{\prime}}}{Q\left( {s^{\prime},a^{\prime}} \right)}}}$

where r is the reward of agent g, e.g., r_(P) (s, a)=r (s, a)+Σ_(i=j)^(n)γ^(j)r (s_(j) ^(A), a_(j) ^(A)), a_(j) ^(A) are selected accordingto Q _(m) _(A) ^(A), and s_(j) ^(A), s′ are sampled from the MDP, and

are empirical Bellman operators constructed using independent samples.The system sets g=P and defines

${{{\overset{\sim}{\mathcal{J}}}_{N}\left( {\overset{\_}{Q}}_{m_{P} - 1}^{P} \right)} = {\frac{1}{N}{\sum_{i \in \mathcal{D}_{N}}{{\hat{\mathcal{J}}}_{i}\left( {\overset{\_}{Q}}_{m_{P} - 1}^{P} \right)}}}},$

where

_(N) is a collection of N i.i.d. samples (i.e., matrices with samplesfor each state-action pair (s, a) from RB_(P)); and sets Q₁ ^(P)=Q _(m)_(g) ⁻¹ ^(P).

From k=1, . . . , K for each value of k: the system computes stepsizeλ_(k)=1/1+(1−γ)k and updates:

Q _(k+1) ^(g)←(1−λ_(k))·Q _(k) ^(g) +λk·[

(Q _(k) ^(g))−

( Q _(m) _(g) ⁻¹ ^(g))+

( Q _(m) _(g) ⁻¹ ^(g))],

where

is empirical Bellman operator constructed using a sample not in

_(N), thus the random operators

and

are independent.

Then the system sets Q _(m) _(g) ^(g)=Q_(K+1) ^(g), m_(g)=m_(g)+1, g=A.

From k=1, . . . , K for each value of k: the system computes

$\begin{matrix}{\begin{matrix}{{B\left( {s_{m_{A}},a_{m_{A}}} \right)} = {{B\left( {s_{m_{A}},a_{m_{A}}} \right)} + 1}} \\{\alpha_{m_{A}} = \frac{1}{B\left( {s_{m_{A}},a_{m_{A}}} \right)}} \\\left. Q_{m_{A}}^{A}\leftarrow{{\left( {1 - \alpha_{m_{A}}} \right) \cdot Q_{m_{A}}^{A}} + {\alpha_{m_{A}} \cdot {\hat{\mathcal{J}}\left( Q_{m_{A}}^{A} \right)}}} \right.\end{matrix}} & (126)\end{matrix}$${\max\limits_{a_{P}}{Q^{P}\left( {s,a_{P}} \right)}} = {{\max\limits_{a_{P}}\max\limits_{a_{A}}{Q^{P}\left( {s,a_{P},a_{A}} \right)}} = {\max\limits_{a_{P}}\min\limits_{a_{A}}{Q^{P}\left( {s,a_{P},a_{A}} \right)}}}$

Then policies (π^(P)*, π^(A)*) are obtained:

J ^(P)(s,Q ^(P) *,Q ^(A)*)=

[Σγ^(t) ·r _(t) ^(P) |s,π _(*) ^(P),π^(A)*]  (127)

i.e., for any other policy π^(P),

[Σγ^(t) ·r _(t) ^(P) |s,π ^(P)*,π^(A)*]≥

[Σγ^(t) ·r _(t) ^(P) |s,π ^(P),π^(A)*]  (128)

for any other policy π^(A),

[Σγ^(t) ·r _(t) ^(A) |s,π ^(P)*,π^(A)*]≥

[Σγ^(t) ·r _(t) ^(A) |s,π ^(P)*,π^(A)]  (129)

FIG. 13 is a schematic diagram of another example computing device 1300that implements a system (e.g., the training engine 112 on tradeplatform 100) for training an automated agent having a neural network,in accordance with an embodiment. As depicted, computing device 1300includes one or more processors 1302, memory 1304, one or more I/Ointerfaces 1306, and, optionally, one or more network interfaces 1308.

Each processor 1302 may be, for example, any type of general-purposemicroprocessor or microcontroller, a digital signal processing (DSP)processor, an integrated circuit, a field programmable gate array(FPGA), a reconfigurable processor, a programmable read-only memory(PROM), or any combination thereof.

Memory 1304 may include a suitable combination of any type of computermemory that is located either internally or externally such as, forexample, random-access memory (RAM), read-only memory (ROM), compactdisc read-only memory (CDROM), electro-optical memory, magneto-opticalmemory, erasable programmable read-only memory (EPROM), andelectrically-erasable programmable read-only memory (EEPROM),Ferroelectric RAM (FRAM) or the like. Memory 1304 may store codeexecutable at processor 1302, which causes training system to functionin manners disclosed herein. Memory 1304 includes a data storage. Insome embodiments, the data storage includes a secure datastore. In someembodiments, the data storage stores received data sets, such as textualdata, image data, or other types of data.

Each I/O interface 1306 enables computing device 1300 to interconnectwith one or more input devices, such as a keyboard, mouse, camera, touchscreen and a microphone, or with one or more output devices such as adisplay screen and a speaker.

Each network interface 1308 enables computing device 1300 to communicatewith other components, to exchange data with other components, to accessand connect to network resources, to serve applications, and performother computing applications by connecting to a network such as network(or multiple networks) capable of carrying data including the Internet,Ethernet, plain old telephone service (POTS) line, public switchtelephone network (PSTN), integrated services digital network (ISDN),digital subscriber line (DSL), coaxial cable, fiber optics, satellite,mobile, wireless (e.g. Wi-Fi, WiMAX), SS7 signaling network, fixed line,local area network, wide area network, and others, including anycombination of these.

The methods disclosed herein may be implemented using a system thatincludes multiple computing devices 1300. The computing devices 1300 maybe the same or different types of devices.

Each computing devices may be connected in various ways includingdirectly coupled, indirectly coupled via a network, and distributed overa wide geographic area and connected via a network (which may bereferred to as “cloud computing”).

For example, and without limitation, each computing device 1300 may be aserver, network appliance, set-top box, embedded device, computerexpansion module, personal computer, laptop, personal data assistant,cellular telephone, smartphone device, UMPC tablets, video displayterminal, gaming console, electronic reading device, and wirelesshypermedia device or any other computing device capable of beingconfigured to carry out the methods described herein.

Embodiments performing the operations for anomaly detection and anomalyscoring provide certain advantages over manually assessing anomalies.For example, in some embodiments, all data points are assessed, whicheliminates subjectivity involved in judgement-based sampling, and mayprovide more statistically significant results than random sampling.Further, the outputs produced by embodiments of system are reproducibleand explainable.

The embodiments of the devices, systems and methods described herein maybe implemented in a combination of both hardware and software. Theseembodiments may be implemented on programmable computers, each computerincluding at least one processor, a data storage system (includingvolatile memory or non-volatile memory or other data storage elements ora combination thereof), and at least one communication interface.

Program code is applied to input data to perform the functions describedherein and to generate output information. The output information isapplied to one or more output devices. In some embodiments, thecommunication interface may be a network communication interface. Inembodiments in which elements may be combined, the communicationinterface may be a software communication interface, such as those forinter-process communication. In still other embodiments, there may be acombination of communication interfaces implemented as hardware,software, and combination thereof.

Throughout the foregoing discussion, numerous references will be maderegarding servers, services, interfaces, portals, platforms, or othersystems formed from computing devices. It should be appreciated that theuse of such terms is deemed to represent one or more computing deviceshaving at least one processor configured to execute softwareinstructions stored on a computer readable tangible, non-transitorymedium. For example, a server can include one or more computersoperating as a web server, database server, or other type of computerserver in a manner to fulfill described roles, responsibilities, orfunctions.

The foregoing discussion provides many example embodiments. Althougheach embodiment represents a single combination of inventive elements,other examples may include all possible combinations of the disclosedelements. Thus if one embodiment comprises elements A, B, and C, and asecond embodiment comprises elements B and D, other remainingcombinations of A, B, C, or D, may also be used.

The term “connected” or “coupled to” may include both direct coupling(in which two elements that are coupled to each other contact eachother) and indirect coupling (in which at least one additional elementis located between the two elements).

The technical solution of embodiments may be in the form of a softwareproduct. The software product may be stored in a non-volatile ornon-transitory storage medium, which can be a compact disk read-onlymemory (CD-ROM), a USB flash disk, or a removable hard disk. Thesoftware product includes a number of instructions that enable acomputer device (personal computer, server, or network device) toexecute the methods provided by the embodiments.

The embodiments described herein are implemented by physical computerhardware, including computing devices, servers, receivers, transmitters,processors, memory, displays, and networks. The embodiments describedherein provide useful physical machines and particularly configuredcomputer hardware arrangements. The embodiments described herein aredirected to electronic machines and methods implemented by electronicmachines adapted for processing and transforming electromagnetic signalswhich represent various types of information. The embodiments describedherein pervasively and integrally relate to machines, and their uses;and the embodiments described herein have no meaning or practicalapplicability outside their use with computer hardware, machines, andvarious hardware components. Substituting the physical hardwareparticularly configured to implement various acts for non-physicalhardware, using mental steps for example, may substantially affect theway the embodiments work. Such computer hardware limitations are clearlyessential elements of the embodiments described herein, and they cannotbe omitted or substituted for mental means without having a materialeffect on the operation and structure of the embodiments describedherein. The computer hardware is essential to implement the variousembodiments described herein and is not merely used to perform stepsexpeditiously and in an efficient manner.

The embodiments and examples described herein are illustrative andnon-limiting. Practical implementation of the features may incorporate acombination of some or all of the aspects, and features described hereinshould not be taken as indications of future or existing product plans.Applicant partakes in both foundational and applied research, and insome cases, the features described are developed on an exploratorybasis.

Although the embodiments have been described in detail, it should beunderstood that various changes, substitutions and alterations can bemade herein without departing from the scope as defined by the appendedclaims.

Moreover, the scope of the present application is not intended to belimited to the particular embodiments of the process, machine,manufacture, composition of matter, means, methods and steps describedin the specification. As one of ordinary skill in the art will readilyappreciate from the disclosure of the present invention, processes,machines, manufacture, compositions of matter, means, methods, or steps,presently existing or later to be developed, that perform substantiallythe same function or achieve substantially the same result as thecorresponding embodiments described herein may be utilized. Accordingly,the appended claims are intended to include within their scope suchprocesses, machines, manufacture, compositions of matter, means,methods, or steps.

REFERENCES

-   [1] Anschel, Oron and Baram, Nir and Shimkin, Nahum. Averaged-dqn:    Variance reduction and stabilization for deep reinforcement    learning. International Conference on Machine Learning, pages    176-185, 2017. PMLR.-   [2] Bagherzadeh, Mojtaba and Kahani, Nafiseh and Briand, Lionel.    Reinforcement Learning for Test Case Prioritization. arXiv preprint    arXiv:2011.01834, 2020.-   [3] Balch, Tucker Hybinette and Mahfouz, Mahmoud and Lockhart,    Joshua and Hybinette, Maria and Byrd, David. How to Evaluate Trading    Strategies: Single Agent Market Replay or Multiple Agent Interactive    Simulation?. arXiv preprint arXiv:1906.12010, 2019.-   [4] Bellemare, Marc G and Candido, Salvatore and Castro, Pablo    Samuel and Gong, Jun and Machado, Marlos C and Moitra, Subhodeep and    Ponda, Sameera S and Wang, Ziyu. Autonomous navigation of    stratospheric balloons using reinforcement learning. Nature,    588(7836):77-82, 2020.-   [5] Berner, Christopher and Brockman, Greg and Chan, Brooke and    Cheung, Vicki and Debiak, Przemyslaw and Dennison, Christy and    Farhi, David and Fischer, Quirin and Hashme, Shariq and Hesse, Chris    and others. Dota 2 with large scale deep reinforcement learning.    arXiv preprint arXiv:1912.06680, 2019.-   [6] Bertsekas, Dimitri P. Neuro-dynamic programming Neuro-Dynamic    Programming, pages 2555-2560. Springer US, Boston, Mass., 2009.-   [7] Bloembergen, Daan and Hennes, Daniel and McBurney, Peter and    Tuyls, Karl. Trading in markets with noisy information: An    evolutionary analysis. Connection Science, 27(3):253-268, 2015.-   [8] Bowling, Michael. Convergence problems of general-sum multiagent    reinforcement learning. ICML, pages 89-94, 2000.-   [9] Bowling, Michael and Veloso, Manuela. Multiagent learning using    a variable learning rate. Artificial Intelligence, 136(2):215-250,    2002.-   [10] Buldygin, V. V. and Kozachenko, Yu. V. Sub-Gaussian random    variables. Ukrainian Mathematical Journal, 32(6):483-489, 1980.-   [11] Byrd, David and Hybinette, Maria and Balch, Tucker Hybinette.    Abides: Towards high-fidelity market simulation for ai research.    arXiv preprint arXiv:1904.12066, 2019.-   [12] Di Castro, Dotan and Tamar, Aviv and Mannor, Shie. Policy    gradients with variance related risk criteria. arXiv preprint    arXiv:1206.6404, 2012.-   [13] Filar, J. and Vrieze, K. Competitive Markov Decision Processes.    Springer, 1997.-   [14] Garcia, Javier and Fernandez, Fernando. A comprehensive survey    on safe reinforcement learning. Journal of Machine Learning    Research, 16(1):1437-1480, 2015.-   [15] Henderson, Peter and Islam, Riashat and Bachman, Philip and    Pineau, Joelle and Precup, Doina and Meger, David. Deep    reinforcement learning that matters. Proceedings of the AAAI    Conference on Artificial Intelligence, number 1, 2018.-   [16] Hu, Junling and Wellman, Michael P. Multiagent Reinforcement    Learning: Theoretical Framework and an Algorithm. Proceedings of the    Fifteenth International Conference on Machine Learning in ICML '98,    pages 242-250, San Francisco, Calif., USA, 1998. Morgan Kaufmann    Publishers Inc.-   [17] Johnson, Rie and Zhang, Tong. Accelerating Stochastic Gradient    Descent using Predictive Variance Reduction. In C. J. C. Burges    and L. Bottou and M. Welling and Z. Ghahramani and K. Q. Weinberger,    editors, Advances in Neural Information Processing Systems, 2013.    Curran Associates, Inc.-   [18] Li, Yuxi. Deep reinforcement learning: An overview. arXiv    preprint arXiv:1701.07274, 2017.-   [19] Li, Yuxi and Szepesvari, Csaba and Schuurmans, Dale. Learning    exercise policies for american options. Artificial Intelligence and    Statistics, pages 352-359, 2009. PMLR.-   [20] Littman, Michael L. Value-function reinforcement learning in    Markov games. Cognitive systems research, 2(1):55-66, 2001.-   [21] Mihatsch, Oliver and Neuneier, Ralph. Risk-sensitive    reinforcement learning. Machine learning, 49(2):267-290, 2002.-   [22] Mirhoseini, Azalia and Goldie, Anna and Yazgan, Mustafa and    Jiang, Joe and Songhori, Ebrahim and Wang, Shen and Lee, Young-Joon    and Johnson, Eric and Pathak, Omkar and Bae, Sungmin and others.    Chip placement with deep reinforcement learning. arXiv preprint    arXiv:2004.10746, 2020.-   [23] Morimoto, Jun and Doya, Kenji. Robust reinforcement learning.    Neural computation, 17(2):335-359, 2005.-   [24] Ning, Brian and Lin, Franco Ho Ting and Jaimungal, Sebastian.    Double deep q-learning for optimal execution. arXiv preprint    arXiv:1812.06600, 2018.-   [25] Pan, Xinlei and Seita, Daniel and Gao, Yang and Canny, John.    Risk averse robust adversarial reinforcement learning. 2019    International Conference on Robotics and Automation (ICRA), pages    8522-8528, 2019. IEEE.-   [26] Pinto, Lerrel and Davidson, James and Sukthankar, Rahul and    Gupta, Abhinay. Robust adversarial reinforcement learning.    International Conference on Machine Learning, pages 2817-2826, 2017.    PMLR.-   [27] Sharpe, William F. The sharpe ratio. Journal of portfolio    management, 21(1):49-58, 1994.-   [28] Shen, Yun and Tobia, Michael J and Sommer, Tobias and    Obermayer, Klaus. Risk-sensitive reinforcement learning. Neural    computation, 26(7):1298-1328, 2014.-   [29] Spooner, Thomas and Fearnley, John and Savani, Rahul and    Koukorinis, Andreas. Market making via reinforcement learning. arXiv    preprint arXiv:1804.04216, 2018.-   [30] SzepesvÃ_(i)ri, Csaba and Littman, Michael. A Unified Analysis    of Value-Function-Based Reinforcement-Learning Algorithms. Neural    computation, 11:2017-59, 1999.-   [31] Théate, Thibaut and Ernst, Damien. An application of deep    reinforcement learning to algorithmic trading. Expert Systems with    Applications, 173:114632, 2021.-   [32] Tobia, M. J. and Guo, Rong and Schwarze, U and Bohmer, Wendelin    and Glascher, Jan and Finckh, Barbara and Marschner, A and Buchel, C    and Obermayer, Klaus and Sommer, Tobias. Neural Systems for Choice    and Valuation with Counterfactual Learning Signals. NeuroImage, 89,    2013.-   [33] Tuyls, Karl and Perolat, Julien and Lanctot, Marc and Hughes,    Edward and Everett, Richard and Leibo, Joel Z and Szepesvari, Csaba    and Graepel, Thore. Bounds and dynamics for empirical game theoretic    analysis. Autonomous Agents and Multi-Agent Systems, 34(1):1-30,    2020.-   [34] Wainwright, Martin J. Variance-reduced Q-learning is minimax    optimal. arXiv preprint arXiv:1906.04697, 2019.-   [35] Walsh, William E and Das, Rajarshi and Tesauro, Gerald and    Kephart, Jeffrey O. Analyzing complex strategic interactions in    multi-agent systems. AAAI-02 Workshop on Game-Theoretic and    Decision-Theoretic Agents, pages 109-118, 2002.-   [36] Weibull, Jörgen W. Evolutionary game theory. MIT press, 1997.-   [37] Weinberg, Michael and Rosenschein, Jeffrey S. Best-response    multiagent learning in non-stationary environments. Proceedings of    the Third International Joint Conference on Autonomous Agents and    Multiagent Systems—Volume 2, pages 506-513, 2004-   [38] Wellman, Michael P. Methods for empirical game-theoretic    analysis. AAAI, pages 1552-1556, 2006.

1. A computer-implemented system for training an automated agent, thesystem comprising: a communication interface; at least one processor;memory in communication with said at least one processor; software codestored in said memory, which when executed at said at least oneprocessor causes said system to: instantiate an automated agent thatmaintains a reinforcement learning neural network and generates,according to outputs of said reinforcement learning neural network,signals for communicating task requests; receive, by way of saidcommunication interface, a plurality of training input data including aplurality of states and a plurality of actions for the automated agent;initialize a learning table Q for the automated agent based on theplurality of states and the plurality of actions; compute a plurality ofupdated learning tables based on the initialized learning table Q usinga utility function, the utility function comprising a monotonicallyincreasing concave function; and generate an averaged learning table Q′based on the plurality of updated learning tables.
 2. Thecomputer-implemented system of claim 1, wherein said automated agent isconfigured to select an action based on the averaged learning table Q′for communicating one or more task requests.
 3. The computer-implementedsystem of claim 1, wherein the utility function is represented byu(x)=−e^(βx), β<0.
 4. The computer-implemented system of claim 1,wherein computing a plurality of updated learning tables comprises:receiving, by way of said communication interface, an input parameter kand a training step parameter T; and for each training step t, wheret=1, 2 . . . T: computing an interim learning table {circumflex over(Q)} based on the initialized learning table Q; selecting an actiona_(t) from the plurality of actions based on the interim learning table{circumflex over (Q)} and a given state s_(t) from the plurality ofstates; computing a reward r_(t) and a next state s_(t+1) based on theselected action a_(t); and for at least two values of i, where i=1, 2, .. . , k, computing a respective updated learning table Q^(i) of theplurality of updated learning tables based on(s_(t),a_(t),r_(t),s_(t+1)) and the utility function.
 5. Thecomputer-implemented system of claim 4, wherein the averaged learningtable Q′ is computed as $\frac{1}{k}{\sum_{i = 1}^{k}{Q^{i}.}}$
 6. Thecomputer-implemented system of claim 1, wherein the utility function isa first utility function, and wherein the software code, when executedat said at least one processor, further causes said system to:instantiate an adversarial agent that maintains an adversarialreinforcement learning neural network and generates, according tooutputs of said adversarial reinforcement learning neural network,signals for communicating adversarial task requests; initialize anadversarial learning table Q_(A) for the adversarial agent; compute aplurality of updated adversarial learning tables based on theinitialized adversarial learning table Q_(A) using a second utilityfunction, the second utility function comprising a monotonicallyincreasing convex function; and generate an averaged adversariallearning table Q_(A)′ based on the plurality of updated adversariallearning tables.
 7. The computer-implemented system of claim 6, whereinsaid adversarial agent is configured to select an adversarial actionbased on the averaged adversarial learning table Q_(A)′ to minimize areward for the automated agent.
 8. The computer-implemented system ofclaim 6, wherein the second utility function is represented byu^(A)(x)=−e^(β) ^(A) ^(x), β^(A)>0.
 9. The computer-implemented systemof claim 6, wherein computing a plurality of updated adversariallearning tables comprises: receiving, by way of said communicationinterface, an input parameter k and a training step parameter T; and foreach training step t, where t=1, 2 . . . T: computing an interimadversarial learning table {circumflex over (Q)}_(A) based on theinitialized adversarial learning table Q^(A); selecting an adversarialaction a_(t) ^(A) based on the interim adversarial learning table{circumflex over (Q)}_(A) and a given state s_(t) from the plurality ofstates; computing an adversarial reward r_(t) ^(A) and a next states_(t+1) based on the selected adversarial action a_(t) ^(A); and for atleast two values of i, where i=1, 2, . . . , k, computing a respectiveupdated adversarial learning table Q^(i) _(A) of the plurality ofupdated adversarial learning tables based on (s_(t),a_(t) ^(A),r_(t)^(A),s_(t+1)) and the second utility function.
 10. Thecomputer-implemented system of claim 9, wherein the averaged adversariallearning table Q_(A)′ is computed as$\frac{1}{k}{\sum_{i = 1}^{k}{{Q^{i}}_{A}.}}$
 11. A computer-implementedmethod of training an automated agent, the method comprising:instantiating an automated agent that maintains a reinforcement learningneural network and generates, according to outputs of said reinforcementlearning neural network, signals for communicating task requests;receiving, by way of said communication interface, a plurality oftraining input data including a plurality of states and a plurality ofactions for the automated agent; initializing a learning table Q for theautomated agent based on the plurality of states and the plurality ofactions; computing a plurality of updated learning tables based on theinitialized learning table Q using a utility function, the utilityfunction comprising a monotonically increasing concave function; andgenerating an averaged learning table Q′ based on the plurality ofupdated learning tables.
 12. The method of claim 11, further comprisingselecting an action, by the automated agent, based on the averagedlearning table Q′ for communicating one or more task requests.
 13. Themethod of claim 11, wherein the utility function is represented byu(x)=−e^(βx), β<0.
 14. The method of claim 11, wherein computing aplurality of updated learning tables comprises: receiving, by way ofsaid communication interface, an input parameter k and a training stepparameter T; and for each training step t, where t=1, 2 . . . T:computing an interim learning table {circumflex over (Q)} based on theinitialized learning table Q; selecting an action a_(t) from theplurality of actions based on the interim learning table {circumflexover (Q)} and a given state s_(t) from the plurality of states;computing a reward r_(t) and a next state s_(t+1) based on the selectedaction a_(t); and for at least two values of i, where i=1, 2, . . . , k,computing a respective updated learning table Q^(i) of the plurality ofupdated learning tables based on (s_(t),a_(t),r_(t),s_(t+1)) and theutility function.
 15. The method of claim 14, wherein the averagedlearning table Q′ is computed as $\frac{1}{k}{\sum_{i = 1}^{k}{Q^{i}.}}$16. The method of claim 11, wherein the utility function is a firstutility function and further comprising: instantiating an adversarialagent that maintains an adversarial reinforcement learning neuralnetwork and generates, according to outputs of said adversarialreinforcement learning neural network, signals for communicatingadversarial task requests; initializing an adversarial learning tableQ_(A) for the adversarial agent; computing a plurality of updatedadversarial learning tables based on the initialized adversariallearning table Q_(A) using a second utility function, the second utilityfunction comprising a monotonically increasing convex function; andgenerating an averaged adversarial learning table Q_(A)′ based on theplurality of updated adversarial learning tables.
 17. The method ofclaim 16, further comprising selecting an adversarial action, by theadversarial agent, based on the averaged adversarial learning tableQ_(A)′ to minimize a reward for the automated agent.
 18. The method ofclaim 16, wherein the second utility function is represented byu^(A)(x)=−e^(β) ^(A) ^(x), β^(A)>0.
 19. The method of claim 16, whereincomputing a plurality of updated adversarial learning tables comprises:receiving, by way of said communication interface, an input parameter kand a training step parameter T; and for each training step t, wheret=1, 2 . . . T: computing an interim adversarial learning table{circumflex over (Q)}^(A) based on the initialized adversarial learningtable Q^(A); selecting an adversarial action a_(t) ^(A) based on theinterim adversarial learning table {circumflex over (Q)}^(A) and a givenstate s_(t) from the plurality of states; computing an adversarialreward r_(t) ^(A) and a next state s_(t+1) based on the selectedadversarial action a_(t) ^(A); and for at least two values of i, wherei=1, 2, . . . , k, computing a respective updated adversarial learningtable Q^(i) _(A) of the plurality of updated adversarial learning tablesbased on (s_(t),a_(t) ^(A),r_(t) ^(A),s_(t+1)) and the second utilityfunction.
 20. The method of claim 19, wherein the averaged adversariallearning table Q_(A)′ is computed as$\frac{1}{k}{\sum_{i = 1}^{k}{{Q^{i}}_{A}.}}$